Class: Html2rss::AutoSource::Article
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Article
- Includes:
- Comparable, Enumerable
- Defined in:
- lib/html2rss/auto_source/article.rb
Overview
Article is a simple data object representing an article extracted from a page. It is enumerable and responds to all keys specified in PROVIDED_KEYS.
Constant Summary collapse
- PROVIDED_KEYS =
%i[id title description url image guid published_at scraper].freeze
Class Method Summary collapse
-
.contains_html?(text) ⇒ Boolean
Checks if the text contains HTML tags.
-
.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i) ⇒ String
Removes the specified pattern from the beginning of the text within a given range if the pattern occurs before the range’s end.
Instance Method Summary collapse
- #<=>(other) ⇒ Object
- #description ⇒ Object
-
#each {|key, value| ... } ⇒ Enumerator
If no block is given.
-
#guid ⇒ String
Generates a unique identifier based on the URL and ID using CRC32.
- #id ⇒ Object
- #image ⇒ Addressable::URI?
-
#initialize(**options) ⇒ Article
constructor
A new instance of Article.
-
#published_at ⇒ DateTime?
Parses and returns the published_at time.
- #scraper ⇒ Object
- #title ⇒ Object
- #url ⇒ Addressable::URI?
-
#valid? ⇒ Boolean
Checks if the article is valid based on the presence of URL, ID, and either title or description.
Constructor Details
#initialize(**options) ⇒ Article
Returns a new instance of Article.
44 45 46 47 48 49 50 51 52 |
# File 'lib/html2rss/auto_source/article.rb', line 44 def initialize(**) @to_h = {} .each_pair { |key, value| @to_h[key] = value.freeze if value } @to_h.freeze return unless (unknown_keys = .keys - PROVIDED_KEYS).any? Log.warn "Article: unknown keys found: #{unknown_keys.join(', ')}" end |
Class Method Details
.contains_html?(text) ⇒ Boolean
Checks if the text contains HTML tags.
39 40 41 |
# File 'lib/html2rss/auto_source/article.rb', line 39 def self.contains_html?(text) Nokogiri::HTML.fragment(text).children.any?(&:element?) end |
.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i) ⇒ String
Removes the specified pattern from the beginning of the text within a given range if the pattern occurs before the range’s end.
26 27 28 29 30 31 32 33 |
# File 'lib/html2rss/auto_source/article.rb', line 26 def self.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i) return text unless text.is_a?(String) && pattern.is_a?(String) index = text.index(pattern) return text if index.nil? || index >= end_of_range text.gsub(/^(.{0,#{end_of_range}})#{Regexp.escape(pattern)}/, '\1') end |
Instance Method Details
#<=>(other) ⇒ Object
120 121 122 123 124 |
# File 'lib/html2rss/auto_source/article.rb', line 120 def <=>(other) return nil unless other.is_a?(Article) 0 if other.all? { |key, value| value == public_send(key) ? public_send(key) <=> value : false } end |
#description ⇒ Object
76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/html2rss/auto_source/article.rb', line 76 def description return @description if defined?(@description) return if (description = @to_h[:description]).to_s.empty? @description = self.class.remove_pattern_from_start(description, title) if title if self.class.contains_html?(@description) && url @description = Html2rss::AttributePostProcessors::SanitizeHtml.get(description, url) else @description end end |
#each {|key, value| ... } ⇒ Enumerator
Returns if no block is given.
62 63 64 65 66 |
# File 'lib/html2rss/auto_source/article.rb', line 62 def each return enum_for(:each) unless block_given? PROVIDED_KEYS.each { |key| yield(key, public_send(key)) } end |
#guid ⇒ String
Generates a unique identifier based on the URL and ID using CRC32.
102 103 104 |
# File 'lib/html2rss/auto_source/article.rb', line 102 def guid @guid ||= Zlib.crc32([url, id].join('#!/')).to_s(36).encode('utf-8') end |
#id ⇒ Object
68 69 70 |
# File 'lib/html2rss/auto_source/article.rb', line 68 def id @to_h[:id] end |
#image ⇒ Addressable::URI?
96 97 98 |
# File 'lib/html2rss/auto_source/article.rb', line 96 def image @image ||= Html2rss::Utils.sanitize_url(@to_h[:image]) end |
#published_at ⇒ DateTime?
Parses and returns the published_at time.
108 109 110 111 112 113 114 |
# File 'lib/html2rss/auto_source/article.rb', line 108 def published_at return if (string = @to_h[:published_at].to_s.strip).empty? @published_at ||= DateTime.parse(string) rescue ArgumentError nil end |
#scraper ⇒ Object
116 117 118 |
# File 'lib/html2rss/auto_source/article.rb', line 116 def scraper @to_h[:scraper] end |
#title ⇒ Object
72 73 74 |
# File 'lib/html2rss/auto_source/article.rb', line 72 def title @to_h[:title] end |
#url ⇒ Addressable::URI?
91 92 93 |
# File 'lib/html2rss/auto_source/article.rb', line 91 def url @url ||= Html2rss::Utils.sanitize_url(@to_h[:url]) end |
#valid? ⇒ Boolean
Checks if the article is valid based on the presence of URL, ID, and either title or description.
56 57 58 |
# File 'lib/html2rss/auto_source/article.rb', line 56 def valid? !url.to_s.empty? && (!title.to_s.empty? || !description.to_s.empty?) && !id.to_s.empty? end |