Method: Wgit::Document.define_extractor

Defined in:
lib/wgit/document.rb

.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol

Defines a content extractor, which extracts HTML elements/content into instance variables upon Document initialization. See the default extractors defined in 'document_extractors.rb' as examples. Defining an extractor means that every subsequently crawled/initialized document will attempt to extract the xpath's content. Use #extract for a one off content extraction on any document.

Note that defined extractors work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extractor once defined, initializes a private instance variable with the xpath or database object result(s).

When initialising from HTML, a singleton value of true will only ever return the first result found; otherwise all the results are returned in an Enumerable. When initialising from a database object, the value is taken as is and singleton is only used to define the default empty value. If a value cannot be found (in either the HTML or database object), then a default will be used. The default value is: singleton ? nil : [].

Parameters:

  • var (Symbol)

    The name of the variable to be initialised, that will contain the extracted content. A getter and setter method is defined for the initialised variable.

  • xpath (String, #call, nil)

    The xpath used to find the element(s) of the webpage. Only used when initializing from HTML. Passing nil will skip the HTML extraction, which sometimes isn't required.

    Pass a callable object (proc etc.) if you want the xpath value to be derived on Document initialisation (instead of when the extractor is defined). The call method must return a valid xpath String.

  • opts (Hash) (defaults to: {})

    The options to define an extractor with. The options are only used when intializing from HTML, not the database.

Options Hash (opts):

  • :singleton (Boolean)

    The singleton option determines whether or not the result(s) should be in an Enumerable. If multiple results are found and singleton is true then the first result will be used. Defaults to true.

  • :text_content_only (Boolean)

    The text_content_only option if true will use the text #content of the Nokogiri result object, otherwise the Nokogiri object itself is returned. The type of Nokogiri object returned depends on the given xpath query. See the Nokogiri documentation for more information. Defaults to true.

Yields:

  • The block is executed when a Wgit::Document is initialized, regardless of the source. Use it (optionally) to process the result value.

Yield Parameters:

  • value (Object)

    The result value to be assigned to the new var.

  • source (Wgit::Document, Object)

    The source of the value.

  • type (Symbol)

    The source type, either :document or (DB) :object.

Yield Returns:

  • (Object)

    The return value of the block becomes the new var's value. Return the block's value param unchanged if you want to inspect.

Returns:

  • (Symbol)

    The given var Symbol if successful.

Raises:

  • (StandardError)

    If the var param isn't valid.



139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# File 'lib/wgit/document.rb', line 139

def self.define_extractor(var, xpath, opts = {}, &block)
  var = var.to_sym
  defaults = { singleton: true, text_content_only: true }
  opts = defaults.merge(opts)

  raise "var must match #{REGEX_EXTRACTOR_NAME}" unless \
  var =~ REGEX_EXTRACTOR_NAME

  # Define the private init_*_from_html method for HTML.
  # Gets the HTML's xpath value and creates a var for it.
  func_name = Document.send(:define_method, "init_#{var}_from_html") do
    result = extract_from_html(xpath, **opts, &block)
    init_var(var, result)
  end
  Document.send(:private, func_name)

  # Define the private init_*_from_object method for a Database object.
  # Gets the Object's 'key' value and creates a var for it.
  func_name = Document.send(
    :define_method, "init_#{var}_from_object"
  ) do |obj|
    result = extract_from_object(
      obj, var.to_s, singleton: opts[:singleton], &block
    )
    init_var(var, result)
  end
  Document.send(:private, func_name)

  @extractors << var
  var
end