Class: RubyCrawl::Result

Inherits:
Object
  • Object
show all
Defined in:
lib/rubycrawl/result.rb

Overview

Immutable result object returned from every crawl. clean_text and clean_markdown are both derived lazily from clean_html so they have consistent content coverage (including hidden/collapsed elements).

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(raw_text:, clean_html:, html:, links:, metadata:) ⇒ Result

Returns a new instance of Result.



12
13
14
15
16
17
18
# File 'lib/rubycrawl/result.rb', line 12

def initialize(raw_text:, clean_html:, html:, links:, metadata:)
  @raw_text   = raw_text
  @clean_html = clean_html
  @html       = html
  @links      = links
     = 
end

Instance Attribute Details

#clean_htmlObject (readonly)

Returns the value of attribute clean_html.



10
11
12
# File 'lib/rubycrawl/result.rb', line 10

def clean_html
  @clean_html
end

#htmlObject (readonly)

Returns the value of attribute html.



10
11
12
# File 'lib/rubycrawl/result.rb', line 10

def html
  @html
end

Returns the value of attribute links.



10
11
12
# File 'lib/rubycrawl/result.rb', line 10

def links
  @links
end

#metadataObject (readonly)

Returns the value of attribute metadata.



10
11
12
# File 'lib/rubycrawl/result.rb', line 10

def 
  
end

#raw_textObject (readonly)

Returns the value of attribute raw_text.



10
11
12
# File 'lib/rubycrawl/result.rb', line 10

def raw_text
  @raw_text
end

Instance Method Details

#clean_markdownString

Markdown derived from noise-stripped HTML. Preserves document structure (headings, lists, links). Lazy — computed on first access.

Returns:



34
35
36
37
# File 'lib/rubycrawl/result.rb', line 34

def clean_markdown
  source = clean_html.empty? ? html : clean_html
  @clean_markdown ||= MarkdownConverter.convert(source, base_url: final_url)
end

#clean_markdown?Boolean

Returns:



46
47
48
# File 'lib/rubycrawl/result.rb', line 46

def clean_markdown?
  !@clean_markdown.nil?
end

#clean_textString

Plain text derived from noise-stripped HTML. Captures hidden/collapsed content (accordions, tabs) that innerText misses. Lazy — computed on first access.

Returns:



25
26
27
# File 'lib/rubycrawl/result.rb', line 25

def clean_text
  @clean_text ||= html_to_text(clean_html.empty? ? html : clean_html)
end

#final_urlString?

The final URL after redirects.

Returns:



41
42
43
# File 'lib/rubycrawl/result.rb', line 41

def final_url
  ['final_url']
end

#to_hObject



50
51
52
53
54
55
56
57
58
59
60
# File 'lib/rubycrawl/result.rb', line 50

def to_h
  {
    raw_text:       raw_text,
    clean_text:     @clean_text,
    clean_html:     clean_html,
    html:           html,
    links:          links,
    metadata:       ,
    clean_markdown: @clean_markdown
  }
end