Class: Wgit::Document

Inherits:

Object

Object
Wgit::Document

show all

Includes:: Assertable

Defined in:: lib/wgit/document.rb

Overview

Class modeling/serialising a HTML web document, although other MIME types will work e.g. images etc. Also doubles as a search result when loading Documents from the database via Wgit::Database#search.

The initialize method dynamically initializes instance variables from the Document HTML / Database object e.g. text. This bit is dynamic so that the Document class can be easily extended allowing you to extract the bits of a webpage that are important to you. See Wgit::Document.define_extractor.

Constant Summary collapse

REGEX_EXTRACTOR_NAME = Regex for the allowed var names when defining an extractor.

/[a-z0-9_]+/.freeze

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::NON_ENUMERABLE_MSG

Class Attribute Summary collapse

.extractors ⇒ Object readonly
Set of Symbols representing the defined Document extractors.
.text_elements ⇒ Object readonly
Set of HTML elements that make up the visible text on a page.

Instance Attribute Summary collapse

#html ⇒ Object (also: #content) readonly
The content/HTML of the document, an instance of String.
#parser ⇒ Object readonly
The Nokogiri::HTML document object initialized from @html.
#score ⇒ Object readonly
The score is only used following a Database#search and records matches.
#url ⇒ Object readonly
The URL of the webpage, an instance of Wgit::Url.

Class Method Summary collapse

.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines a content extractor, which extracts HTML elements/content into instance variables upon Document initialization.
.remove_extractor(var) ⇒ Boolean
Removes the init_* methods created when an extractor is defined.
.remove_extractors ⇒ Object
Removes all default and defined extractors by calling Document.remove_extractor underneath.
.text_elements_xpath ⇒ String
Uses Document.text_elements to build an xpath String, used to obtain all of the combined visual text on a webpage.

Instance Method Summary collapse

#==(other) ⇒ Boolean
Determines if both the url and html match.
#[](range) ⇒ String
Shortcut for calling Document#html[range].
#at_css(selector) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_css method to search the doc's html and return the result.
#at_xpath(xpath) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_xpath method to search the doc's html and return the result.
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document.
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's css method to search the doc's html and return the results.
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
#external_links ⇒ Array<Wgit::Url> (also: #external_urls)
Returns all unique external links from this Document in absolute form.
#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
Extracts a value/object from this Document's @html using the given xpath parameter.
#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object protected
Extracts a value/object from this Document's @html using the given xpath parameter.
#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object protected
Returns a value from the obj using the given key via obj#fetch.
#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML protected
Initializes the nokogiri object using @html, which cannot be nil.
#initialize(url_or_obj, html = '', encode: true) ⇒ Document constructor
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page).
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Document.
#internal_absolute_links ⇒ Array<Wgit::Url> (also: #internal_absolute_urls)
Returns all unique internal links from this Document in absolute form by appending them to self's #base_url.
#internal_links ⇒ Array<Wgit::Url> (also: #internal_urls)
Returns all unique internal links from this Document in relative form.
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Searches the @text for the given query and returns the results.
#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text search (see Document#search for details) but assigns the results to the @text instance variable.
#size ⇒ Integer
Determine the size of this Document's HTML.
#stats ⇒ Hash (also: #statistics)
Returns a Hash containing this Document's instance variables and their #length (if they respond to it).
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars.
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the results.

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(url_or_obj, html = '', encode: true) ⇒ `Document`

Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page). This allows for initialisation from both crawled web pages and documents/web pages retrieved from the database.

During initialisation, the Document will call any private init_*_from_html and init_*_from_object methods it can find. See the Wgit::Document.define_extractor method for more details.

Parameters:

url_or_obj (String, Wgit::Url, #fetch) —
Either a String representing a URL or a Hash-like object responding to :fetch. e.g. a MongoDB collection object. The Object's :fetch method should support Strings as keys.
html (String, NilClass) (defaults to: '') —
The crawled web page's content/HTML. This param is only used if url_or_obj is a String representing the web page's URL. Otherwise, the HTML comes from the database object. A html of nil will be defaulted to an empty String.
encode (Boolean) (defaults to: true) —
Whether or not to UTF-8 encode the html. Set to false if the Document content is an image etc.

# File 'lib/wgit/document.rb', line 77

def initialize(url_or_obj, html = '', encode: true)
  if url_or_obj.is_a?(String)
    init_from_strings(url_or_obj, html, encode: encode)
  else
    init_from_object(url_or_obj, encode: encode)
  end
end

Class Attribute Details

.extractors ⇒ `Object` (readonly)

Set of Symbols representing the defined Document extractors. Is read-only. Use Wgit::Document.define_extractor for a new extractor.



43
44
45

# File 'lib/wgit/document.rb', line 43

def extractors
  @extractors
end

.text_elements ⇒ `Object` (readonly)

Set of HTML elements that make up the visible text on a page. These elements are used to initialize the Wgit::Document#text. See the README.md for how to add to this Set dynamically.



39
40
41

# File 'lib/wgit/document.rb', line 39

def text_elements
  @text_elements
end

Instance Attribute Details

#html ⇒ `Object` (readonly) Also known as: content

The content/HTML of the document, an instance of String.



50
51
52

# File 'lib/wgit/document.rb', line 50

def html
  @html
end

#parser ⇒ `Object` (readonly)

The Nokogiri::HTML document object initialized from @html.



53
54
55

# File 'lib/wgit/document.rb', line 53

def parser
  @parser
end

#score ⇒ `Object` (readonly)

The score is only used following a Database#search and records matches.



56
57
58

# File 'lib/wgit/document.rb', line 56

def score
  @score
end

#url ⇒ `Object` (readonly)

The URL of the webpage, an instance of Wgit::Url.



47
48
49

# File 'lib/wgit/document.rb', line 47

def url
  @url
end

Class Method Details

.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ `Symbol`

Defines a content extractor, which extracts HTML elements/content into instance variables upon Document initialization. See the default extractors defined in 'document_extractors.rb' as examples. Defining an extractor means that every subsequently crawled/initialized document will attempt to extract the xpath's content. Use #extract for a one off content extraction on any document.

Note that defined extractors work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extractor once defined, initializes a private instance variable with the xpath or database object result(s).

When initialising from HTML, a singleton value of true will only ever return the first result found; otherwise all the results are returned in an Enumerable. When initialising from a database object, the value is taken as is and singleton is only used to define the default empty value. If a value cannot be found (in either the HTML or database object), then a default will be used. The default value is: singleton ? nil : [].

Parameters:

var (Symbol) —
The name of the variable to be initialised, that will contain the extracted content. A getter and setter method is defined for the initialised variable.
xpath (String, #call) —
The xpath used to find the element(s) of the webpage. Only used when initializing from HTML.

Pass a callable object (proc etc.) if you want the xpath value to be derived on Document initialisation (instead of when the extractor is defined). The call method must return a valid xpath String.
opts (Hash) (defaults to: {}) —
The options to define an extractor with. The options are only used when intializing from HTML, not the database.

Options Hash (opts):

:singleton (Boolean) —
The singleton option determines whether or not the result(s) should be in an Enumerable. If multiple results are found and singleton is true then the first result will be used. Defaults to true.
:text_content_only (Boolean) —
The text_content_only option if true will use the text #content of the Nokogiri result object, otherwise the Nokogiri object itself is returned. The type of Nokogiri object returned depends on the given xpath query. See the Nokogiri documentation for more information. Defaults to true.

Yields:

The block is executed when a Wgit::Document is initialized, regardless of the source. Use it (optionally) to process the result value.

Yield Parameters:

value (Object) —
The result value to be assigned to the new var.
source (Wgit::Document, Object) —
The source of the value.
type (Symbol) —
The source type, either :document or (DB) :object.

Yield Returns:

(Object) —
The return value of the block becomes the new var's value. Return the block's value param unchanged if you want to inspect.

Returns:

(Symbol) —
The given var Symbol if successful.

Raises:

(StandardError) —
If the var param isn't valid.

# File 'lib/wgit/document.rb', line 151

def self.define_extractor(var, xpath, opts = {}, &block)
  var = var.to_sym
  defaults = { singleton: true, text_content_only: true }
  opts = defaults.merge(opts)

  raise "var must match #{REGEX_EXTRACTOR_NAME}" unless \
  var =~ REGEX_EXTRACTOR_NAME

  # Define the private init_*_from_html method for HTML.
  # Gets the HTML's xpath value and creates a var for it.
  func_name = Document.send(:define_method, "init_#{var}_from_html") do
    result = extract_from_html(xpath, **opts, &block)
    init_var(var, result)
  end
  Document.send(:private, func_name)

  # Define the private init_*_from_object method for a Database object.
  # Gets the Object's 'key' value and creates a var for it.
  func_name = Document.send(
    :define_method, "init_#{var}_from_object"
  ) do |obj|
    result = extract_from_object(
      obj, var.to_s, singleton: opts[:singleton], &block
    )
    init_var(var, result)
  end
  Document.send(:private, func_name)

  @extractors << var
  var
end

.remove_extractor(var) ⇒ `Boolean`

Removes the init_* methods created when an extractor is defined. Therefore, this is the opposing method to Document.define_extractor. Returns true if successful or false if the method(s) cannot be found.

Parameters:

var (Symbol) —
The extractor variable to remove.

Returns:

(Boolean) —
True if the extractor var was found and removed; otherwise false.

# File 'lib/wgit/document.rb', line 190

def self.remove_extractor(var)
  Document.send(:remove_method, "init_#{var}_from_html")
  Document.send(:remove_method, "init_#{var}_from_object")

  @extractors.delete(var.to_sym)

  true
rescue NameError
  false
end

.remove_extractors ⇒ `Object`

Removes all default and defined extractors by calling Document.remove_extractor underneath. See its documentation.



203
204
205

# File 'lib/wgit/document.rb', line 203

def self.remove_extractors
  @extractors.each { |var| remove_extractor(var) }
end

.text_elements_xpath ⇒ `String`

Uses Document.text_elements to build an xpath String, used to obtain all of the combined visual text on a webpage.

Returns:

(String) —
An xpath String to obtain a webpage's text elements.

# File 'lib/wgit/document.rb', line 91

def self.text_elements_xpath
  Wgit::Document.text_elements.each_with_index.reduce('') do |xpath, (el, i)|
    xpath += ' | ' unless i.zero?
    xpath += format('//%s/text()', el)
  end
end

Instance Method Details

#==(other) ⇒ `Boolean`

Determines if both the url and html match. Use doc.object_id == other.object_id for exact object comparison.

Parameters:

other (Wgit::Document) —
To compare self against.

Returns:

(Boolean) —
True if @url and @html are equal, false if not.

# File 'lib/wgit/document.rb', line 221

def ==(other)
  return false unless other.is_a?(Wgit::Document)

  (@url == other.url) && (@html == other.html)
end

#[](range) ⇒ `String`

Shortcut for calling Document#html[range].

Parameters:

range (Range) —
The range of @html to return.

Returns:

(String) —
The given range of @html.



231
232
233

# File 'lib/wgit/document.rb', line 231

def [](range)
  @html[range]
end

#at_css(selector) ⇒ `Nokogiri::XML::Element`

Uses Nokogiri's at_css method to search the doc's html and return the result. Use #css for returning several results.

Parameters:

selector (String) —
The CSS selector to search the @html with.

Returns:

(Nokogiri::XML::Element) —
The result of the CSS search.



381
382
383

# File 'lib/wgit/document.rb', line 381

def at_css(selector)
  @parser.at_css(selector)
end

#at_xpath(xpath) ⇒ `Nokogiri::XML::Element`

Uses Nokogiri's at_xpath method to search the doc's html and return the result. Use #xpath for returning several results.

Parameters:

xpath (String) —
The xpath to search the @html with.

Returns:

(Nokogiri::XML::Element) —
The result of the xpath search.



363
364
365

# File 'lib/wgit/document.rb', line 363

def at_xpath(xpath)
  @parser.at_xpath(xpath)
end

#base_url(link: nil) ⇒ `Wgit::Url`

Returns the base URL of this Wgit::Document. The base URL is either the element's href value or @url (if @base is nil). If @base is present and relative, then @url.to_origin + @base is returned. This method should be used instead of doc.url.to_origin etc. when manually building absolute links from relative links; or use link.make_absolute(doc).

Provide the link: parameter to get the correct base URL for that type of link. For example, a link of #top would always return @url because it applies to that page, not a different one. Query strings work in the same way. Use this parameter if manually concatting Url's e.g.

relative_link = Wgit::Url.new('?q=hello') absolute_link = doc.base_url(link: relative_link).concat(relative_link)

This is similar to how Wgit::Document#internal_absolute_links works.

Parameters:

link (Wgit::Url, String) (defaults to: nil) —
The link to obtain the correct base URL for; must be relative, not absolute.

Returns:

(Wgit::Url) —
The base URL of this Document e.g. 'http://example.com/public'.

Raises:

(StandardError) —
If link is relative or if a base URL can't be established e.g. the doc @url is relative and is nil.

# File 'lib/wgit/document.rb', line 257

def base_url(link: nil)
  if @url.relative? && @base.nil?
    raise "Document @url ('#{@url}') cannot be relative if <base> is nil"
  end

  if @url.relative? && @base&.relative?
    raise "Document @url ('#{@url}') and <base> ('#{@base}') both can't \
be relative"
  end

  get_base = -> { @base.relative? ? @url.to_origin.concat(@base) : @base }

  if link
    link = Wgit::Url.new(link)
    raise "link must be relative: #{link}" unless link.relative?

    if link.is_fragment? || link.is_query?
      base_url = @base ? get_base.call : @url
      return base_url.omit_fragment.omit_query
    end
  end

  base_url = @base ? get_base.call : @url.to_origin
  base_url.omit_fragment.omit_query
end

#css(selector) ⇒ `Nokogiri::XML::NodeSet`

Uses Nokogiri's css method to search the doc's html and return the results. Use #at_css for returning the first result only.

Parameters:

selector (String) —
The CSS selector to search the @html with.

Returns:

(Nokogiri::XML::NodeSet) —
The result set of the CSS search.



372
373
374

# File 'lib/wgit/document.rb', line 372

def css(selector)
  @parser.css(selector)
end

#empty? ⇒ `Boolean`

Determine if this Document's HTML is empty or not.

Returns:

(Boolean) —
True if @html is nil/empty, false otherwise.

# File 'lib/wgit/document.rb', line 343

def empty?
  return true if @html.nil?

  @html.empty?
end

#external_links ⇒ `Array<Wgit::Url>` Also known as: external_urls

Returns all unique external links from this Document in absolute form. External meaning a link to a different host.

Returns:

(Array<Wgit::Url>) —
Self's unique external Url's in absolute form.

# File 'lib/wgit/document.rb', line 422

def external_links
  return [] if @links.empty?

  links = @links
          .map do |link|
            if link.scheme_relative?
              link.prefix_scheme(@url.to_scheme.to_sym)
            else
              link
            end
          end
          .reject { |link| link.relative?(host: @url.to_origin) }
          .map(&:omit_trailing_slash)

  Wgit::Utils.sanitize(links)
end

#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ `String`, `Object`

Extracts a value/object from this Document's @html using the given xpath parameter.

Parameters:

xpath (String, #call) —
Used to find the value/object in @html.
singleton (Boolean) (defaults to: true) —
singleton ? results.first (single Object) : results (Enumerable).
text_content_only (Boolean) (defaults to: true) —
text_content_only ? result.content (String) : result (Nokogiri Object).

Yields:

(Optionally) —
Pass a block to read/write the result value before it's returned.

Yield Parameters:

value (Object) —
The result value to be returned.
source (Wgit::Document, Object) —
This Document instance.
type (Symbol) —
The source type, which is :document.

Yield Returns:

(Object) —
The return value of the block gets returned. Return the block's value param unchanged if you simply want to inspect it.

Returns:

(String, Object) —
The value found in the html or the default value (singleton ? nil : []).

# File 'lib/wgit/document.rb', line 535

def extract(xpath, singleton: true, text_content_only: true, &block)
  send(
    :extract_from_html, xpath,
    singleton: singleton, text_content_only: text_content_only,
    &block
  )
end

#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ `String`, `Object` (protected)

Extracts a value/object from this Document's @html using the given xpath parameter.

Parameters:

xpath (String, #call) —
Used to find the value/object in @html.
singleton (Boolean) (defaults to: true) —
singleton ? results.first (single Object) : results (Enumerable).
text_content_only (Boolean) (defaults to: true) —
text_content_only ? result.content (String) : result (Nokogiri Object).

Yields:

(Optionally) —
Pass a block to read/write the result value before it's returned.

Yield Parameters:

value (Object) —
The result value to be returned.
source (Wgit::Document, Object) —
This Document instance.
type (Symbol) —
The source type, which is :document.

Yield Returns:

(Object) —
The return value of the block gets returned. Return the block's value param unchanged if you simply want to inspect it.

Returns:

(String, Object) —
The value found in the html or the default value (singleton ? nil : []).

# File 'lib/wgit/document.rb', line 576

def extract_from_html(xpath, singleton: true, text_content_only: true)
  xpath  = xpath.call if xpath.respond_to?(:call)
  result = singleton ? at_xpath(xpath) : xpath(xpath)

  if result && text_content_only
    result = singleton ? result.content : result.map(&:content)
  end

  Wgit::Utils.sanitize(result)
  result = yield(result, self, :document) if block_given?
  result
end

#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ `String`, `Object` (protected)

Returns a value from the obj using the given key via obj#fetch.

Parameters:

obj (#fetch) —
The object containing the key/value.
key (String) —
Used to find the value in the obj.
singleton (Boolean) (defaults to: true) —
True if a single value, false otherwise.

Yields:

The block is executed when a Wgit::Document is initialized, regardless of the source. Use it (optionally) to process the result value.

Yield Parameters:

value (Object) —
The result value to be returned.
source (Wgit::Document, Object) —
The source of the value.
type (Symbol) —
The source type, either :document or (DB) :object.

Yield Returns:

(Object) —
The return value of the block gets returned. Return the block's value param unchanged if you simply want to inspect it.

Returns:

(String, Object) —
The value found in the obj or the default value (singleton ? nil : []).

# File 'lib/wgit/document.rb', line 605

def extract_from_object(obj, key, singleton: true)
  assert_respond_to(obj, :fetch)

  default = singleton ? nil : []
  result  = obj.fetch(key.to_s, default)

  Wgit::Utils.sanitize(result)
  result = yield(result, obj, :object) if block_given?
  result
end

#init_nokogiri {|config| ... } ⇒ `Nokogiri::HTML` (protected)

Initializes the nokogiri object using @html, which cannot be nil. Override this method to custom configure the Nokogiri object returned. Gets called from Wgit::Document.new upon initialization.

Yields:

(config) —
The given block is passed to Nokogiri::HTML for initialisation.

Returns:

(Nokogiri::HTML) —
The initialised Nokogiri HTML object.

Raises:

(StandardError) —
If @html isn't set.

# File 'lib/wgit/document.rb', line 553

def init_nokogiri(&block)
  raise '@html must be set' unless @html

  Nokogiri::HTML(@html, &block)
end

#inspect ⇒ `String`

Overrides String#inspect to shorten the printed output of a Document.

Returns:

(String) —
A short textual representation of this Document.



212
213
214

# File 'lib/wgit/document.rb', line 212

def inspect
  "#<Wgit::Document url=\"#{@url}\" html=#{size} bytes>"
end

#internal_absolute_links ⇒ `Array<Wgit::Url>` Also known as: internal_absolute_urls

Returns all unique internal links from this Document in absolute form by appending them to self's #base_url. Also see Wgit::Document#internal_links.

Returns:

(Array<Wgit::Url>) —
Self's unique internal Url's in absolute form.



414
415
416

# File 'lib/wgit/document.rb', line 414

def internal_absolute_links
  internal_links.map { |link| link.make_absolute(self) }
end

#internal_links ⇒ `Array<Wgit::Url>` Also known as: internal_urls

Returns all unique internal links from this Document in relative form. Internal meaning a link to another document on the same host.

This Document's host is used to determine if an absolute URL is actually a relative link e.g. For a Document representing http://www.server.com/about, an absolute link of will be recognized and returned as an internal link because both Documents live on the same host. Also see Wgit::Document#internal_absolute_links.

Returns:

(Array<Wgit::Url>) —
Self's unique internal Url's in relative form.

# File 'lib/wgit/document.rb', line 396

def internal_links
  return [] if @links.empty?

  links = @links
          .select { |link| link.relative?(host: @url.to_origin) }
          .map(&:omit_base)
          .map do |link| # Map @url.to_host into / as it's a duplicate.
    link.to_host == @url.to_host ? Wgit::Url.new('/') : link
  end

  Wgit::Utils.sanitize(links)
end

#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ `Array<String>`

Searches the @text for the given query and returns the results.

The number of search hits for each sentenence are recorded internally and used to rank/sort the search results before being returned. Where the Wgit::Database#search method search all documents for the most hits, this method searches each document's @text for the most hits.

Each search result comprises of a sentence of a given length. The length will be based on the sentence_limit parameter or the full length of the original sentence, which ever is less. The algorithm obviously ensures that the search query is visible somewhere in the sentence.

Parameters:

query (Regexp, #to_s) —
The regex or text value to search the document's @text for.
case_sensitive (Boolean) (defaults to: false) —
Whether character case must match.
whole_sentence (Boolean) (defaults to: true) —
Whether multiple words should be searched for separately.
sentence_limit (Integer) (defaults to: 80) —
The max length of each search result sentence.

Returns:

(Array<String>) —
A subset of @text, matching the query.

# File 'lib/wgit/document.rb', line 459

def search(
  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
)
  raise 'The sentence_limit value must be even' if sentence_limit.odd?

  if query.is_a?(Regexp)
    regex = query
  else # query.respond_to? :to_s == true
    query = query.to_s
    query = query.gsub(' ', '|') unless whole_sentence
    regex = Regexp.new(query, !case_sensitive)
  end

  results = {}

  @text.each do |sentence|
    sentence = sentence.strip
    next if results[sentence]

    hits = sentence.scan(regex).count
    next unless hits.positive?

    index = sentence.index(regex) # Index of first match.
    Wgit::Utils.format_sentence_length(sentence, index, sentence_limit)

    results[sentence] = hits
  end

  return [] if results.empty?

  results = Hash[results.sort_by { |_k, v| v }]
  results.keys.reverse
end

#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ `String`

Performs a text search (see Document#search for details) but assigns the results to the @text instance variable. This can be used for sub search functionality. The original text is returned; no other reference to it is kept thereafter.

Parameters:

query (Regexp, #to_s) —
The regex or text value to search the document's @text for.
case_sensitive (Boolean) (defaults to: false) —
Whether character case must match.
whole_sentence (Boolean) (defaults to: true) —
Whether multiple words should be searched for separately.
sentence_limit (Integer) (defaults to: 80) —
The max length of each search result sentence.

Returns:

(String) —
This Document's original @text value.

# File 'lib/wgit/document.rb', line 506

def search!(
  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
)
  orig_text = @text
  @text = search(
    query, case_sensitive: case_sensitive,
           whole_sentence: whole_sentence, sentence_limit: sentence_limit
  )

  orig_text
end

#size ⇒ `Integer`

Determine the size of this Document's HTML.

Returns:

(Integer) —
The total number of @html bytes.



336
337
338

# File 'lib/wgit/document.rb', line 336

def size
  stats[:html]
end

#stats ⇒ `Hash` Also known as: statistics

Returns a Hash containing this Document's instance variables and their #length (if they respond to it). Works dynamically so that any user defined extractors (and their created instance vars) will appear in the returned Hash as well. The number of text snippets as well as total number of textual bytes are always included in the returned Hash.

Returns:

(Hash) —
Containing self's HTML page statistics.

# File 'lib/wgit/document.rb', line 315

def stats
  hash = {}
  instance_variables.each do |var|
    # Add up the total bytes of text as well as the length.
    if var == :@text
      hash[:text]       = @text.length
      hash[:text_bytes] = @text.sum(&:length)
    # Else take the var's #length method return value.
    else
      next unless instance_variable_get(var).respond_to?(:length)

      hash[var[1..-1].to_sym] = instance_variable_get(var).send(:length)
    end
  end

  hash
end

#to_h(include_html: false, include_score: true) ⇒ `Hash`

Returns a Hash containing this Document's instance vars. Used when storing the Document in a Database e.g. MongoDB etc. By default the @html var is excluded from the returned Hash.

Parameters:

include_html (Boolean) (defaults to: false) —
Whether or not to include @html in the returned Hash.

Returns:

(Hash) —
Containing self's instance vars.

# File 'lib/wgit/document.rb', line 290

def to_h(include_html: false, include_score: true)
  ignore = include_html ? [] : ['@html']
  ignore << '@score' unless include_score
  ignore << '@parser' # Always ignore the Nokogiri object.

  Wgit::Utils.to_h(self, ignore: ignore)
end

#to_json(include_html: false) ⇒ `String`

Converts this Document's #to_h return value to a JSON String.

Parameters:

include_html (Boolean) (defaults to: false) —
Whether or not to include @html in the returned JSON String.

Returns:

(String) —
This Document represented as a JSON String.

# File 'lib/wgit/document.rb', line 303

def to_json(include_html: false)
  h = to_h(include_html: include_html)
  JSON.generate(h)
end

#xpath(xpath) ⇒ `Nokogiri::XML::NodeSet`

Uses Nokogiri's xpath method to search the doc's html and return the results. Use #at_xpath for returning the first result only.

Parameters:

xpath (String) —
The xpath to search the @html with.

Returns:

(Nokogiri::XML::NodeSet) —
The result set of the xpath search.



354
355
356

# File 'lib/wgit/document.rb', line 354

def xpath(xpath)
  @parser.xpath(xpath)
end

Class: Wgit::Document

Overview

Constant Summary collapse

Constants included from Assertable

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

Constructor Details

#initialize(url_or_obj, html = '', encode: true) ⇒ Document

Class Attribute Details

.extractors ⇒ Object (readonly)

.text_elements ⇒ Object (readonly)

Instance Attribute Details

#html ⇒ Object (readonly) Also known as: content

#parser ⇒ Object (readonly)

#score ⇒ Object (readonly)

#url ⇒ Object (readonly)

Class Method Details

.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol

.remove_extractor(var) ⇒ Boolean

.remove_extractors ⇒ Object

.text_elements_xpath ⇒ String

Instance Method Details

#==(other) ⇒ Boolean

#[](range) ⇒ String

#at_css(selector) ⇒ Nokogiri::XML::Element

#at_xpath(xpath) ⇒ Nokogiri::XML::Element

#base_url(link: nil) ⇒ Wgit::Url

#css(selector) ⇒ Nokogiri::XML::NodeSet

#empty? ⇒ Boolean

#external_links ⇒ Array<Wgit::Url> Also known as: external_urls

#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object

#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object (protected)

#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object (protected)

#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML (protected)

#inspect ⇒ String

#internal_absolute_links ⇒ Array<Wgit::Url> Also known as: internal_absolute_urls

#internal_links ⇒ Array<Wgit::Url> Also known as: internal_urls

#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>

#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String

#size ⇒ Integer

#stats ⇒ Hash Also known as: statistics

#to_h(include_html: false, include_score: true) ⇒ Hash

#to_json(include_html: false) ⇒ String

#xpath(xpath) ⇒ Nokogiri::XML::NodeSet

#initialize(url_or_obj, html = '', encode: true) ⇒ `Document`

.extractors ⇒ `Object` (readonly)

.text_elements ⇒ `Object` (readonly)

#html ⇒ `Object` (readonly) Also known as: content

#parser ⇒ `Object` (readonly)

#score ⇒ `Object` (readonly)

#url ⇒ `Object` (readonly)

.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ `Symbol`

.remove_extractor(var) ⇒ `Boolean`

.remove_extractors ⇒ `Object`

.text_elements_xpath ⇒ `String`

#==(other) ⇒ `Boolean`

#[](range) ⇒ `String`

#at_css(selector) ⇒ `Nokogiri::XML::Element`

#at_xpath(xpath) ⇒ `Nokogiri::XML::Element`

#base_url(link: nil) ⇒ `Wgit::Url`

#css(selector) ⇒ `Nokogiri::XML::NodeSet`

#empty? ⇒ `Boolean`

#external_links ⇒ `Array<Wgit::Url>` Also known as: external_urls

#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ `String`, `Object`

#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ `String`, `Object` (protected)

#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ `String`, `Object` (protected)

#init_nokogiri {|config| ... } ⇒ `Nokogiri::HTML` (protected)

#inspect ⇒ `String`

#internal_absolute_links ⇒ `Array<Wgit::Url>` Also known as: internal_absolute_urls

#internal_links ⇒ `Array<Wgit::Url>` Also known as: internal_urls

#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ `Array<String>`

#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ `String`

#size ⇒ `Integer`

#stats ⇒ `Hash` Also known as: statistics

#to_h(include_html: false, include_score: true) ⇒ `Hash`

#to_json(include_html: false) ⇒ `String`

#xpath(xpath) ⇒ `Nokogiri::XML::NodeSet`