Class: Wgit::HTMLToText

Inherits:

Object

Object
Wgit::HTMLToText

show all

Includes:: Assertable

Defined in:: lib/wgit/html_to_text.rb

Overview

Class used to extract the visible page text from a HTML string. This is in turn used to set the output of a Wgit::Document#text method.

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Class Attribute Summary collapse

.text_elements ⇒ Object readonly
Set of HTML elements that make up the visible text on a page.

Instance Attribute Summary collapse

#parser ⇒ Object readonly
The Nokogiri::HTML document object initialized from a HTML string.

Instance Method Summary collapse

#extract_arr ⇒ Array<String> (also: #extract)
Extracts and returns the text sentences from the @parser HTML.
#extract_str ⇒ String
Extracts and returns a text string from the @parser HTML.
#initialize(parser) ⇒ HTMLToText constructor
Creates a new HTML to text extractor instance.

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(parser) ⇒ `HTMLToText`

Creates a new HTML to text extractor instance.

Parameters:

parser (Nokogiri::HTML4::Document) —
The nokogiri parser object.

Raises:

(StandardError) —
If the given parser is of an invalid type.

# File 'lib/wgit/html_to_text.rb', line 102

def initialize(parser)
  assert_type(parser, Nokogiri::HTML4::Document)

  @parser = parser
end

Class Attribute Details

.text_elements ⇒ `Object` (readonly)

Set of HTML elements that make up the visible text on a page. These elements are used to initialize the Wgit::Document#text. See the README.md for how to add to this Hash dynamically.



92
93
94

# File 'lib/wgit/html_to_text.rb', line 92

def text_elements
  @text_elements
end

Instance Attribute Details

#parser ⇒ `Object` (readonly)

The Nokogiri::HTML document object initialized from a HTML string.



96
97
98

# File 'lib/wgit/html_to_text.rb', line 96

def parser
  @parser
end

Instance Method Details

#extract_arr ⇒ `Array<String>` Also known as: extract

Extracts and returns the text sentences from the @parser HTML.

Returns:

(Array<String>) —
An array of unique text sentences.

# File 'lib/wgit/html_to_text.rb', line 111

def extract_arr
  return [] if @parser.to_s.empty?

  text_str = extract_str

  # Split the text_str into an Array of text sentences.
  text_str
    .split("\n")
    .map(&:strip)
    .reject(&:empty?)
end

#extract_str ⇒ `String`

Extracts and returns a text string from the @parser HTML.