Class: RubyCrawl::Result
- Inherits:
-
Object
- Object
- RubyCrawl::Result
- Defined in:
- lib/rubycrawl/result.rb
Overview
Immutable result object returned from every crawl. clean_text and clean_markdown are both derived lazily from clean_html so they have consistent content coverage (including hidden/collapsed elements).
Instance Attribute Summary collapse
-
#clean_html ⇒ Object
readonly
Returns the value of attribute clean_html.
-
#html ⇒ Object
readonly
Returns the value of attribute html.
-
#links ⇒ Object
readonly
Returns the value of attribute links.
-
#metadata ⇒ Object
readonly
Returns the value of attribute metadata.
-
#raw_text ⇒ Object
readonly
Returns the value of attribute raw_text.
Instance Method Summary collapse
-
#clean_markdown ⇒ String
Markdown derived from noise-stripped HTML.
- #clean_markdown? ⇒ Boolean
-
#clean_text ⇒ String
Plain text derived from noise-stripped HTML.
-
#final_url ⇒ String?
The final URL after redirects.
-
#initialize(raw_text:, clean_html:, html:, links:, metadata:) ⇒ Result
constructor
A new instance of Result.
- #to_h ⇒ Object
Constructor Details
#initialize(raw_text:, clean_html:, html:, links:, metadata:) ⇒ Result
Returns a new instance of Result.
12 13 14 15 16 17 18 |
# File 'lib/rubycrawl/result.rb', line 12 def initialize(raw_text:, clean_html:, html:, links:, metadata:) @raw_text = raw_text @clean_html = clean_html @html = html @links = links = end |
Instance Attribute Details
#clean_html ⇒ Object (readonly)
Returns the value of attribute clean_html.
10 11 12 |
# File 'lib/rubycrawl/result.rb', line 10 def clean_html @clean_html end |
#html ⇒ Object (readonly)
Returns the value of attribute html.
10 11 12 |
# File 'lib/rubycrawl/result.rb', line 10 def html @html end |
#links ⇒ Object (readonly)
Returns the value of attribute links.
10 11 12 |
# File 'lib/rubycrawl/result.rb', line 10 def links @links end |
#metadata ⇒ Object (readonly)
Returns the value of attribute metadata.
10 11 12 |
# File 'lib/rubycrawl/result.rb', line 10 def end |
#raw_text ⇒ Object (readonly)
Returns the value of attribute raw_text.
10 11 12 |
# File 'lib/rubycrawl/result.rb', line 10 def raw_text @raw_text end |
Instance Method Details
#clean_markdown ⇒ String
Markdown derived from noise-stripped HTML. Preserves document structure (headings, lists, links). Lazy — computed on first access.
34 35 36 37 |
# File 'lib/rubycrawl/result.rb', line 34 def clean_markdown source = clean_html.empty? ? html : clean_html @clean_markdown ||= MarkdownConverter.convert(source, base_url: final_url) end |
#clean_markdown? ⇒ Boolean
46 47 48 |
# File 'lib/rubycrawl/result.rb', line 46 def clean_markdown? !@clean_markdown.nil? end |
#clean_text ⇒ String
Plain text derived from noise-stripped HTML. Captures hidden/collapsed content (accordions, tabs) that innerText misses. Lazy — computed on first access.
25 26 27 |
# File 'lib/rubycrawl/result.rb', line 25 def clean_text @clean_text ||= html_to_text(clean_html.empty? ? html : clean_html) end |
#final_url ⇒ String?
The final URL after redirects.
41 42 43 |
# File 'lib/rubycrawl/result.rb', line 41 def final_url ['final_url'] end |
#to_h ⇒ Object
50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/rubycrawl/result.rb', line 50 def to_h { raw_text: raw_text, clean_text: @clean_text, clean_html: clean_html, html: html, links: links, metadata: , clean_markdown: @clean_markdown } end |