Class: Wgit::Document
- Inherits:
-
Object
- Object
- Wgit::Document
- Includes:
- Assertable
- Defined in:
- lib/wgit/document.rb
Overview
Class modeling/serialising a HTML web document, although other MIME types
will work e.g. images etc. Also doubles as a search result when
loading Documents from the database via Wgit::Database#search
.
The initialize method dynamically initializes instance variables from the
Document HTML / Database object e.g. text. This bit is dynamic so that the
Document class can be easily extended allowing you to extract the bits of
a webpage that are important to you. See Wgit::Document.define_extractor
.
Constant Summary collapse
- REGEX_EXTRACTOR_NAME =
Regex for the allowed var names when defining an extractor.
/[a-z0-9_]+/.freeze
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::NON_ENUMERABLE_MSG
Class Attribute Summary collapse
-
.extractors ⇒ Object
readonly
Set of Symbols representing the defined Document extractors.
-
.text_elements ⇒ Object
readonly
Set of HTML elements that make up the visible text on a page.
Instance Attribute Summary collapse
-
#html ⇒ Object
(also: #content)
readonly
The content/HTML of the document, an instance of String.
-
#parser ⇒ Object
readonly
The Nokogiri::HTML document object initialized from @html.
-
#score ⇒ Object
readonly
The score is only used following a
Database#search
and records matches. -
#url ⇒ Object
readonly
The URL of the webpage, an instance of Wgit::Url.
Class Method Summary collapse
-
.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines a content extractor, which extracts HTML elements/content into instance variables upon Document initialization.
-
.remove_extractor(var) ⇒ Boolean
Removes the
init_*
methods created when an extractor is defined. -
.remove_extractors ⇒ Object
Removes all default and defined extractors by calling
Document.remove_extractor
underneath. -
.text_elements_xpath ⇒ String
Uses Document.text_elements to build an xpath String, used to obtain all of the combined visual text on a webpage.
Instance Method Summary collapse
-
#==(other) ⇒ Boolean
Determines if both the url and html match.
-
#[](range) ⇒ String
Shortcut for calling Document#html[range].
-
#at_css(selector) ⇒ Nokogiri::XML::Element
Uses Nokogiri's
at_css
method to search the doc's html and return the result. -
#at_xpath(xpath) ⇒ Nokogiri::XML::Element
Uses Nokogiri's
at_xpath
method to search the doc's html and return the result. -
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document.
-
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's
css
method to search the doc's html and return the results. -
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
-
#external_links ⇒ Array<Wgit::Url>
(also: #external_urls)
Returns all unique external links from this Document in absolute form.
-
#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
Extracts a value/object from this Document's @html using the given xpath parameter.
-
#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
protected
Extracts a value/object from this Document's @html using the given xpath parameter.
-
#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object
protected
Returns a value from the obj using the given key via
obj#fetch
. -
#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML
protected
Initializes the nokogiri object using @html, which cannot be nil.
-
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
constructor
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page).
-
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Document.
-
#internal_absolute_links ⇒ Array<Wgit::Url>
(also: #internal_absolute_urls)
Returns all unique internal links from this Document in absolute form by appending them to self's #base_url.
-
#internal_links ⇒ Array<Wgit::Url>
(also: #internal_urls)
Returns all unique internal links from this Document in relative form.
-
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Searches the @text for the given query and returns the results.
-
#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text search (see Document#search for details) but assigns the results to the @text instance variable.
-
#size ⇒ Integer
Determine the size of this Document's HTML.
-
#stats ⇒ Hash
(also: #statistics)
Returns a Hash containing this Document's instance variables and their #length (if they respond to it).
-
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars.
-
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
-
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the results.
Methods included from Assertable
#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page). This allows for initialisation from both crawled web pages and documents/web pages retrieved from the database.
During initialisation, the Document will call any private
init_*_from_html
and init_*_from_object
methods it can find. See the
Wgit::Document.define_extractor method for more details.
77 78 79 80 81 82 83 |
# File 'lib/wgit/document.rb', line 77 def initialize(url_or_obj, html = '', encode: true) if url_or_obj.is_a?(String) init_from_strings(url_or_obj, html, encode: encode) else init_from_object(url_or_obj, encode: encode) end end |
Class Attribute Details
.extractors ⇒ Object (readonly)
Set of Symbols representing the defined Document extractors. Is read-only. Use Wgit::Document.define_extractor for a new extractor.
43 44 45 |
# File 'lib/wgit/document.rb', line 43 def extractors @extractors end |
.text_elements ⇒ Object (readonly)
Set of HTML elements that make up the visible text on a page. These elements are used to initialize the Wgit::Document#text. See the README.md for how to add to this Set dynamically.
39 40 41 |
# File 'lib/wgit/document.rb', line 39 def text_elements @text_elements end |
Instance Attribute Details
#html ⇒ Object (readonly) Also known as: content
The content/HTML of the document, an instance of String.
50 51 52 |
# File 'lib/wgit/document.rb', line 50 def html @html end |
#parser ⇒ Object (readonly)
The Nokogiri::HTML document object initialized from @html.
53 54 55 |
# File 'lib/wgit/document.rb', line 53 def parser @parser end |
#score ⇒ Object (readonly)
The score is only used following a Database#search
and records matches.
56 57 58 |
# File 'lib/wgit/document.rb', line 56 def score @score end |
#url ⇒ Object (readonly)
The URL of the webpage, an instance of Wgit::Url.
47 48 49 |
# File 'lib/wgit/document.rb', line 47 def url @url end |
Class Method Details
.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines a content extractor, which extracts HTML elements/content
into instance variables upon Document initialization. See the default
extractors defined in 'document_extractors.rb' as examples. Defining an
extractor means that every subsequently crawled/initialized document
will attempt to extract the xpath's content. Use #extract
for a one off
content extraction on any document.
Note that defined extractors work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extractor once defined, initializes a private instance variable with the xpath or database object result(s).
When initialising from HTML, a singleton value of true will only
ever return the first result found; otherwise all the results are
returned in an Enumerable. When initialising from a database object, the
value is taken as is and singleton is only used to define the default
empty value. If a value cannot be found (in either the HTML or database
object), then a default will be used. The default value is:
singleton ? nil : []
.
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
# File 'lib/wgit/document.rb', line 151 def self.define_extractor(var, xpath, opts = {}, &block) var = var.to_sym defaults = { singleton: true, text_content_only: true } opts = defaults.merge(opts) raise "var must match #{REGEX_EXTRACTOR_NAME}" unless \ var =~ REGEX_EXTRACTOR_NAME # Define the private init_*_from_html method for HTML. # Gets the HTML's xpath value and creates a var for it. func_name = Document.send(:define_method, "init_#{var}_from_html") do result = extract_from_html(xpath, **opts, &block) init_var(var, result) end Document.send(:private, func_name) # Define the private init_*_from_object method for a Database object. # Gets the Object's 'key' value and creates a var for it. func_name = Document.send( :define_method, "init_#{var}_from_object" ) do |obj| result = extract_from_object( obj, var.to_s, singleton: opts[:singleton], &block ) init_var(var, result) end Document.send(:private, func_name) @extractors << var var end |
.remove_extractor(var) ⇒ Boolean
Removes the init_*
methods created when an extractor is defined.
Therefore, this is the opposing method to Document.define_extractor
.
Returns true if successful or false if the method(s) cannot be found.
190 191 192 193 194 195 196 197 198 199 |
# File 'lib/wgit/document.rb', line 190 def self.remove_extractor(var) Document.send(:remove_method, "init_#{var}_from_html") Document.send(:remove_method, "init_#{var}_from_object") @extractors.delete(var.to_sym) true rescue NameError false end |
.remove_extractors ⇒ Object
Removes all default and defined extractors by calling
Document.remove_extractor
underneath. See its documentation.
203 204 205 |
# File 'lib/wgit/document.rb', line 203 def self.remove_extractors @extractors.each { |var| remove_extractor(var) } end |
.text_elements_xpath ⇒ String
Uses Document.text_elements to build an xpath String, used to obtain all of the combined visual text on a webpage.
91 92 93 94 95 96 |
# File 'lib/wgit/document.rb', line 91 def self.text_elements_xpath Wgit::Document.text_elements.each_with_index.reduce('') do |xpath, (el, i)| xpath += ' | ' unless i.zero? xpath += format('//%s/text()', el) end end |
Instance Method Details
#==(other) ⇒ Boolean
Determines if both the url and html match. Use doc.object_id == other.object_id for exact object comparison.
221 222 223 224 225 |
# File 'lib/wgit/document.rb', line 221 def ==(other) return false unless other.is_a?(Wgit::Document) (@url == other.url) && (@html == other.html) end |
#[](range) ⇒ String
Shortcut for calling Document#html[range].
231 232 233 |
# File 'lib/wgit/document.rb', line 231 def [](range) @html[range] end |
#at_css(selector) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_css
method to search the doc's html and return the
result. Use #css
for returning several results.
381 382 383 |
# File 'lib/wgit/document.rb', line 381 def at_css(selector) @parser.at_css(selector) end |
#at_xpath(xpath) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_xpath
method to search the doc's html and return the
result. Use #xpath
for returning several results.
363 364 365 |
# File 'lib/wgit/document.rb', line 363 def at_xpath(xpath) @parser.at_xpath(xpath) end |
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document. The base URL is either the
doc.url.to_origin
etc. when manually building
absolute links from relative links; or use link.make_absolute(doc)
.
Provide the link:
parameter to get the correct base URL for that type
of link. For example, a link of #top
would always return @url because
it applies to that page, not a different one. Query strings work in the
same way. Use this parameter if manually concatting Url's e.g.
relative_link = Wgit::Url.new('?q=hello') absolute_link = doc.base_url(link: relative_link).concat(relative_link)
This is similar to how Wgit::Document#internal_absolute_links works.
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
# File 'lib/wgit/document.rb', line 257 def base_url(link: nil) if @url.relative? && @base.nil? raise "Document @url ('#{@url}') cannot be relative if <base> is nil" end if @url.relative? && @base&.relative? raise "Document @url ('#{@url}') and <base> ('#{@base}') both can't \ be relative" end get_base = -> { @base.relative? ? @url.to_origin.concat(@base) : @base } if link link = Wgit::Url.new(link) raise "link must be relative: #{link}" unless link.relative? if link.is_fragment? || link.is_query? base_url = @base ? get_base.call : @url return base_url.omit_fragment.omit_query end end base_url = @base ? get_base.call : @url.to_origin base_url.omit_fragment.omit_query end |
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's css
method to search the doc's html and return the
results. Use #at_css
for returning the first result only.
372 373 374 |
# File 'lib/wgit/document.rb', line 372 def css(selector) @parser.css(selector) end |
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
343 344 345 346 347 |
# File 'lib/wgit/document.rb', line 343 def empty? return true if @html.nil? @html.empty? end |
#external_links ⇒ Array<Wgit::Url> Also known as: external_urls
Returns all unique external links from this Document in absolute form. External meaning a link to a different host.
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 |
# File 'lib/wgit/document.rb', line 422 def external_links return [] if @links.empty? links = @links .map do |link| if link.scheme_relative? link.prefix_scheme(@url.to_scheme.to_sym) else link end end .reject { |link| link.relative?(host: @url.to_origin) } .map(&:omit_trailing_slash) Wgit::Utils.sanitize(links) end |
#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
Extracts a value/object from this Document's @html using the given xpath parameter.
535 536 537 538 539 540 541 |
# File 'lib/wgit/document.rb', line 535 def extract(xpath, singleton: true, text_content_only: true, &block) send( :extract_from_html, xpath, singleton: singleton, text_content_only: text_content_only, &block ) end |
#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object (protected)
Extracts a value/object from this Document's @html using the given xpath parameter.
576 577 578 579 580 581 582 583 584 585 586 587 |
# File 'lib/wgit/document.rb', line 576 def extract_from_html(xpath, singleton: true, text_content_only: true) xpath = xpath.call if xpath.respond_to?(:call) result = singleton ? at_xpath(xpath) : xpath(xpath) if result && text_content_only result = singleton ? result.content : result.map(&:content) end Wgit::Utils.sanitize(result) result = yield(result, self, :document) if block_given? result end |
#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object (protected)
Returns a value from the obj using the given key via obj#fetch
.
605 606 607 608 609 610 611 612 613 614 |
# File 'lib/wgit/document.rb', line 605 def extract_from_object(obj, key, singleton: true) assert_respond_to(obj, :fetch) default = singleton ? nil : [] result = obj.fetch(key.to_s, default) Wgit::Utils.sanitize(result) result = yield(result, obj, :object) if block_given? result end |
#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML (protected)
Initializes the nokogiri object using @html, which cannot be nil. Override this method to custom configure the Nokogiri object returned. Gets called from Wgit::Document.new upon initialization.
553 554 555 556 557 |
# File 'lib/wgit/document.rb', line 553 def init_nokogiri(&block) raise '@html must be set' unless @html Nokogiri::HTML(@html, &block) end |
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Document.
212 213 214 |
# File 'lib/wgit/document.rb', line 212 def inspect "#<Wgit::Document url=\"#{@url}\" html=#{size} bytes>" end |
#internal_absolute_links ⇒ Array<Wgit::Url> Also known as: internal_absolute_urls
Returns all unique internal links from this Document in absolute form by appending them to self's #base_url. Also see Wgit::Document#internal_links.
414 415 416 |
# File 'lib/wgit/document.rb', line 414 def internal_absolute_links internal_links.map { |link| link.make_absolute(self) } end |
#internal_links ⇒ Array<Wgit::Url> Also known as: internal_urls
Returns all unique internal links from this Document in relative form. Internal meaning a link to another document on the same host.
This Document's host is used to determine if an absolute URL is actually a relative link e.g. For a Document representing http://www.server.com/about, an absolute link of will be recognized and returned as an internal link because both Documents live on the same host. Also see Wgit::Document#internal_absolute_links.
396 397 398 399 400 401 402 403 404 405 406 407 |
# File 'lib/wgit/document.rb', line 396 def internal_links return [] if @links.empty? links = @links .select { |link| link.relative?(host: @url.to_origin) } .map(&:omit_base) .map do |link| # Map @url.to_host into / as it's a duplicate. link.to_host == @url.to_host ? Wgit::Url.new('/') : link end Wgit::Utils.sanitize(links) end |
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Searches the @text for the given query and returns the results.
The number of search hits for each sentenence are recorded internally and used to rank/sort the search results before being returned. Where the Wgit::Database#search method search all documents for the most hits, this method searches each document's @text for the most hits.
Each search result comprises of a sentence of a given length. The length will be based on the sentence_limit parameter or the full length of the original sentence, which ever is less. The algorithm obviously ensures that the search query is visible somewhere in the sentence.
459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 |
# File 'lib/wgit/document.rb', line 459 def search( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) raise 'The sentence_limit value must be even' if sentence_limit.odd? if query.is_a?(Regexp) regex = query else # query.respond_to? :to_s == true query = query.to_s query = query.gsub(' ', '|') unless whole_sentence regex = Regexp.new(query, !case_sensitive) end results = {} @text.each do |sentence| sentence = sentence.strip next if results[sentence] hits = sentence.scan(regex).count next unless hits.positive? index = sentence.index(regex) # Index of first match. Wgit::Utils.format_sentence_length(sentence, index, sentence_limit) results[sentence] = hits end return [] if results.empty? results = Hash[results.sort_by { |_k, v| v }] results.keys.reverse end |
#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text search (see Document#search for details) but assigns the results to the @text instance variable. This can be used for sub search functionality. The original text is returned; no other reference to it is kept thereafter.
506 507 508 509 510 511 512 513 514 515 516 |
# File 'lib/wgit/document.rb', line 506 def search!( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) orig_text = @text @text = search( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, sentence_limit: sentence_limit ) orig_text end |
#size ⇒ Integer
Determine the size of this Document's HTML.
336 337 338 |
# File 'lib/wgit/document.rb', line 336 def size stats[:html] end |
#stats ⇒ Hash Also known as: statistics
Returns a Hash containing this Document's instance variables and their #length (if they respond to it). Works dynamically so that any user defined extractors (and their created instance vars) will appear in the returned Hash as well. The number of text snippets as well as total number of textual bytes are always included in the returned Hash.
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 |
# File 'lib/wgit/document.rb', line 315 def stats hash = {} instance_variables.each do |var| # Add up the total bytes of text as well as the length. if var == :@text hash[:text] = @text.length hash[:text_bytes] = @text.sum(&:length) # Else take the var's #length method return value. else next unless instance_variable_get(var).respond_to?(:length) hash[var[1..-1].to_sym] = instance_variable_get(var).send(:length) end end hash end |
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars. Used when storing the Document in a Database e.g. MongoDB etc. By default the @html var is excluded from the returned Hash.
290 291 292 293 294 295 296 |
# File 'lib/wgit/document.rb', line 290 def to_h(include_html: false, include_score: true) ignore = include_html ? [] : ['@html'] ignore << '@score' unless include_score ignore << '@parser' # Always ignore the Nokogiri object. Wgit::Utils.to_h(self, ignore: ignore) end |
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
303 304 305 306 |
# File 'lib/wgit/document.rb', line 303 def to_json(include_html: false) h = to_h(include_html: include_html) JSON.generate(h) end |
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the
results. Use #at_xpath
for returning the first result only.
354 355 356 |
# File 'lib/wgit/document.rb', line 354 def xpath(xpath) @parser.xpath(xpath) end |