Class: RubyCrawl::SiteCrawler::PageResult
- Inherits:
-
Object
- Object
- RubyCrawl::SiteCrawler::PageResult
- Defined in:
- lib/rubycrawl/site_crawler.rb
Overview
Page result yielded to the block with lazy clean_markdown.
Instance Attribute Summary collapse
-
#clean_html ⇒ Object
readonly
Returns the value of attribute clean_html.
-
#depth ⇒ Object
readonly
Returns the value of attribute depth.
-
#html ⇒ Object
readonly
Returns the value of attribute html.
-
#links ⇒ Object
readonly
Returns the value of attribute links.
-
#metadata ⇒ Object
readonly
Returns the value of attribute metadata.
-
#raw_text ⇒ Object
readonly
Returns the value of attribute raw_text.
-
#url ⇒ Object
readonly
Returns the value of attribute url.
Instance Method Summary collapse
-
#clean_markdown ⇒ Object
Markdown derived from noise-stripped HTML.
-
#clean_text ⇒ Object
Plain text derived from noise-stripped HTML.
-
#final_url ⇒ Object
The final URL after redirects.
-
#initialize(url:, html:, raw_text:, clean_html:, links:, metadata:, depth:) ⇒ PageResult
constructor
A new instance of PageResult.
Constructor Details
#initialize(url:, html:, raw_text:, clean_html:, links:, metadata:, depth:) ⇒ PageResult
Returns a new instance of PageResult.
13 14 15 16 17 18 19 20 21 |
# File 'lib/rubycrawl/site_crawler.rb', line 13 def initialize(url:, html:, raw_text:, clean_html:, links:, metadata:, depth:) @url = url @html = html @raw_text = raw_text @clean_html = clean_html @links = links @metadata = @depth = depth end |
Instance Attribute Details
#clean_html ⇒ Object (readonly)
Returns the value of attribute clean_html.
11 12 13 |
# File 'lib/rubycrawl/site_crawler.rb', line 11 def clean_html @clean_html end |
#depth ⇒ Object (readonly)
Returns the value of attribute depth.
11 12 13 |
# File 'lib/rubycrawl/site_crawler.rb', line 11 def depth @depth end |
#html ⇒ Object (readonly)
Returns the value of attribute html.
11 12 13 |
# File 'lib/rubycrawl/site_crawler.rb', line 11 def html @html end |
#links ⇒ Object (readonly)
Returns the value of attribute links.
11 12 13 |
# File 'lib/rubycrawl/site_crawler.rb', line 11 def links @links end |
#metadata ⇒ Object (readonly)
Returns the value of attribute metadata.
11 12 13 |
# File 'lib/rubycrawl/site_crawler.rb', line 11 def @metadata end |
#raw_text ⇒ Object (readonly)
Returns the value of attribute raw_text.
11 12 13 |
# File 'lib/rubycrawl/site_crawler.rb', line 11 def raw_text @raw_text end |
#url ⇒ Object (readonly)
Returns the value of attribute url.
11 12 13 |
# File 'lib/rubycrawl/site_crawler.rb', line 11 def url @url end |
Instance Method Details
#clean_markdown ⇒ Object
Markdown derived from noise-stripped HTML. Lazy — same as Result#clean_markdown.
32 33 34 35 |
# File 'lib/rubycrawl/site_crawler.rb', line 32 def clean_markdown source = clean_html.empty? ? html : clean_html @clean_markdown ||= MarkdownConverter.convert(source, base_url: final_url) end |
#clean_text ⇒ Object
Plain text derived from noise-stripped HTML. Lazy — same as Result#clean_text.
24 25 26 27 28 29 |
# File 'lib/rubycrawl/site_crawler.rb', line 24 def clean_text @clean_text ||= Result.new( html: html, raw_text: raw_text, clean_html: clean_html, links: links, metadata: ).clean_text end |
#final_url ⇒ Object
The final URL after redirects.
38 39 40 |
# File 'lib/rubycrawl/site_crawler.rb', line 38 def final_url ['final_url'] || url end |