Class: Html2rss::AutoSource::Article

Inherits:

Object

Object
Html2rss::AutoSource::Article

show all

Includes:: Comparable, Enumerable

Defined in:: lib/html2rss/auto_source/article.rb

Overview

Article is a simple data object representing an article extracted from a page. It is enumerable and responds to all keys specified in PROVIDED_KEYS.

Constant Summary collapse

PROVIDED_KEYS =

%i[id title description url image guid published_at scraper].freeze

Class Method Summary collapse

.contains_html?(text) ⇒ Boolean

Checks if the text contains HTML tags.
.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i) ⇒ String

Removes the specified pattern from the beginning of the text within a given range if the pattern occurs before the range’s end.

Instance Method Summary collapse

#<=>(other) ⇒ Object
#description ⇒ Object
#each {|key, value| ... } ⇒ Enumerator

If no block is given.
#guid ⇒ String

Generates a unique identifier based on the URL and ID using CRC32.
#id ⇒ Object
#image ⇒ Addressable::URI^?
#initialize(**options) ⇒ Article constructor

A new instance of Article.
#published_at ⇒ DateTime^?

Parses and returns the published_at time.
#scraper ⇒ Object
#title ⇒ Object
#url ⇒ Addressable::URI^?
#valid? ⇒ Boolean

Checks if the article is valid based on the presence of URL, ID, and either title or description.

Constructor Details

#initialize(**options) ⇒ `Article`

Returns a new instance of Article.

Parameters:

options (Hash<Symbol, String>)

# File 'lib/html2rss/auto_source/article.rb', line 44

def initialize(**options)
  @to_h = {}
  options.each_pair { |key, value| @to_h[key] = value.freeze if value }
  @to_h.freeze

  return unless (unknown_keys = options.keys - PROVIDED_KEYS).any?

  Log.warn "Article: unknown keys found: #{unknown_keys.join(', ')}"
end

Class Method Details

.contains_html?(text) ⇒ `Boolean`

Checks if the text contains HTML tags.

Parameters:

text (String)

Returns:

(Boolean)



39
40
41

# File 'lib/html2rss/auto_source/article.rb', line 39

def self.contains_html?(text)
  Nokogiri::HTML.fragment(text).children.any?(&:element?)
end

.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i) ⇒ `String`

Removes the specified pattern from the beginning of the text within a given range if the pattern occurs before the range’s end.

Parameters:

text (String)
pattern (String)
end_of_range (Integer) (defaults to: (text.size * 0.5).to_i) —
- Optional, defaults to half the size of the text

Returns:

(String)

# File 'lib/html2rss/auto_source/article.rb', line 26

def self.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i)
  return text unless text.is_a?(String) && pattern.is_a?(String)

  index = text.index(pattern)
  return text if index.nil? || index >= end_of_range

  text.gsub(/^(.{0,#{end_of_range}})#{Regexp.escape(pattern)}/, '\1')
end

Instance Method Details

#<=>(other) ⇒ `Object`

# File 'lib/html2rss/auto_source/article.rb', line 120

def <=>(other)
  return nil unless other.is_a?(Article)

  0 if other.all? { |key, value| value == public_send(key) ? public_send(key) <=> value : false }
end

#description ⇒ `Object`

# File 'lib/html2rss/auto_source/article.rb', line 76

def description
  return @description if defined?(@description)

  return if (description = @to_h[:description]).to_s.empty?

  @description = self.class.remove_pattern_from_start(description, title) if title

  if self.class.contains_html?(@description) && url
    @description = Html2rss::AttributePostProcessors::SanitizeHtml.get(description, url)
  else
    @description
  end
end

#each {|key, value| ... } ⇒ `Enumerator`

Returns if no block is given.

Yields:

(key, value)

Returns:

(Enumerator) —

if no block is given

# File 'lib/html2rss/auto_source/article.rb', line 62

def each
  return enum_for(:each) unless block_given?

  PROVIDED_KEYS.each { |key| yield(key, public_send(key)) }
end

#guid ⇒ `String`

Generates a unique identifier based on the URL and ID using CRC32.

Returns:

(String)



102
103
104

# File 'lib/html2rss/auto_source/article.rb', line 102

def guid
  @guid ||= Zlib.crc32([url, id].join('#!/')).to_s(36).encode('utf-8')
end

#id ⇒ `Object`



68
69
70

# File 'lib/html2rss/auto_source/article.rb', line 68

def id
  @to_h[:id]
end

#image ⇒ `Addressable::URI`^?

Returns:

(Addressable::URI, nil)



96
97
98

# File 'lib/html2rss/auto_source/article.rb', line 96

def image
  @image ||= Html2rss::Utils.sanitize_url(@to_h[:image])
end

#published_at ⇒ `DateTime`^?

Parses and returns the published_at time.

Returns:

(DateTime, nil)

# File 'lib/html2rss/auto_source/article.rb', line 108

def published_at
  return if (string = @to_h[:published_at].to_s.strip).empty?

  @published_at ||= DateTime.parse(string)
rescue ArgumentError
  nil
end

#scraper ⇒ `Object`



116
117
118

# File 'lib/html2rss/auto_source/article.rb', line 116

def scraper
  @to_h[:scraper]
end

#title ⇒ `Object`



72
73
74

# File 'lib/html2rss/auto_source/article.rb', line 72

def title
  @to_h[:title]
end

#url ⇒ `Addressable::URI`^?

Returns:

(Addressable::URI, nil)



91
92
93

# File 'lib/html2rss/auto_source/article.rb', line 91

def url
  @url ||= Html2rss::Utils.sanitize_url(@to_h[:url])
end

#valid? ⇒ `Boolean`

Checks if the article is valid based on the presence of URL, ID, and either title or description.

Returns:

(Boolean) —

True if the article is valid, otherwise false.



56
57
58

# File 'lib/html2rss/auto_source/article.rb', line 56

def valid?
  !url.to_s.empty? && (!title.to_s.empty? || !description.to_s.empty?) && !id.to_s.empty?
end

Class: Html2rss::AutoSource::Article

Overview

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**options) ⇒ Article

Class Method Details

.contains_html?(text) ⇒ Boolean

.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i) ⇒ String

Instance Method Details

#<=>(other) ⇒ Object

#description ⇒ Object

#each {|key, value| ... } ⇒ Enumerator

#guid ⇒ String

#id ⇒ Object

#image ⇒ Addressable::URI?

#published_at ⇒ DateTime?

#scraper ⇒ Object

#title ⇒ Object

#url ⇒ Addressable::URI?

#valid? ⇒ Boolean

#initialize(**options) ⇒ `Article`

.contains_html?(text) ⇒ `Boolean`

.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i) ⇒ `String`

#<=>(other) ⇒ `Object`

#description ⇒ `Object`

#each {|key, value| ... } ⇒ `Enumerator`

#guid ⇒ `String`

#id ⇒ `Object`

#image ⇒ `Addressable::URI`^?

#published_at ⇒ `DateTime`^?

#scraper ⇒ `Object`

#title ⇒ `Object`

#url ⇒ `Addressable::URI`^?

#valid? ⇒ `Boolean`