Class: Wgit::HTMLToText

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/html_to_text.rb

Overview

Class used to extract the visible page text from a HTML string. This is in turn used to set the output of a Wgit::Document#text method.

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Class Attribute Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(parser) ⇒ HTMLToText

Creates a new HTML to text extractor instance.

Parameters:

  • parser (Nokogiri::HTML4::Document)

    The nokogiri parser object.

Raises:

  • (StandardError)

    If the given parser is of an invalid type.



102
103
104
105
106
# File 'lib/wgit/html_to_text.rb', line 102

def initialize(parser)
  assert_type(parser, Nokogiri::HTML4::Document)

  @parser = parser
end

Class Attribute Details

.text_elementsObject (readonly)

Set of HTML elements that make up the visible text on a page. These elements are used to initialize the Wgit::Document#text. See the README.md for how to add to this Hash dynamically.



92
93
94
# File 'lib/wgit/html_to_text.rb', line 92

def text_elements
  @text_elements
end

Instance Attribute Details

#parserObject (readonly)

The Nokogiri::HTML document object initialized from a HTML string.



96
97
98
# File 'lib/wgit/html_to_text.rb', line 96

def parser
  @parser
end

Instance Method Details

#extract_arrArray<String> Also known as: extract

Extracts and returns the text sentences from the @parser HTML.

Returns:

  • (Array<String>)

    An array of unique text sentences.



111
112
113
114
115
116
117
118
119
120
121
# File 'lib/wgit/html_to_text.rb', line 111

def extract_arr
  return [] if @parser.to_s.empty?

  text_str = extract_str

  # Split the text_str into an Array of text sentences.
  text_str
    .split("\n")
    .map(&:strip)
    .reject(&:empty?)
end

#extract_strString

Extracts and returns a text string from the @parser HTML.

Returns:

  • (String)

    A string of text with \n delimiting sentences.



126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# File 'lib/wgit/html_to_text.rb', line 126

def extract_str
  text_str = ""

  iterate_child_nodes(@parser) do |node, display|
    # Handle any special cases e.g. skip nodes we don't care about...
    # <pre> nodes should have their contents displayed exactly as is.
    if node_name(node) == :pre
      text_str << "\n"
      text_str << node.text
      next
    end

    # Skip any child node of <pre> since they're handled as a special case above.
    next if child_of?(:pre, node)

    if node.text?
      # Skip any text element that is purely whitespace.
      next unless valid_text_content?(node.text)
    else
      # Skip a concrete node if it has other concrete child nodes as these
      # will be iterated onto later.
      #
      # Process if node has no children or one child which is a valid text node.
      next unless node.children.empty? || parent_of_text_node_only?(node)
    end

    # Apply display rules deciding if a new line is needed before node.text.
    add_new_line = false
    prev = prev_sibling_or_parent(node)

    if node.text?
      add_new_line = true unless prev && inline?(prev)
    else
      add_new_line = true if display == :block
      add_new_line = true if prev && block?(prev)
    end

    text_str << "\n" if add_new_line
    text_str << format_text(node.text)
  end

  text_str
    .strip
    .squeeze("\n")
    .squeeze(" ")
end