Class: IMW::Parsers::HtmlMatchers::MatchRegexp

Inherits:

Matcher

Object
Matcher
IMW::Parsers::HtmlMatchers::MatchRegexp

show all

Defined in:: lib/imw/parsers/html_parser/matchers.rb

Overview

Concrete subclass of IMW::Parsers::HtmlMatchers::Matcher for using a regular expression to match against text in an HTML document.

Instance Attribute Summary collapse

#options ⇒ Object

Returns the value of attribute options.
#re ⇒ Object

Returns the value of attribute re.

Attributes inherited from Matcher

#matcher, #selector

Instance Method Summary collapse

#initialize(selector, re, matcher = nil, options = {}) ⇒ MatchRegexp constructor

Use the regular expression re to return captures from the elements collected by selector (treated as text) used on an HTML document (if selector is nil then match against the full text of the document).
#match(doc) ⇒ Object

Grab the first element from doc matching the selector this object was initialized with.

Constructor Details

#initialize(selector, re, matcher = nil, options = {}) ⇒ `MatchRegexp`

Use the regular expression re to return captures from the elements collected by selector (treated as text) used on an HTML document (if selector is nil then match against the full text of the document). If the keyword argument :capture is specified then return the corresponding group (indexing is that of regular expressions; “1” is the first capture), else return an array of all captures. If matcher, then use it on the capture(s) before returning.

FIXME Shouldn’t the matcher come BEFORE the regexp capture, not after?

# File 'lib/imw/parsers/html_parser/matchers.rb', line 153

def initialize selector, re, matcher=nil, options={}
  super selector, matcher
  self.options = options
  self.re = re
end

Instance Attribute Details

#options ⇒ `Object`

Returns the value of attribute options.



140
141
142

# File 'lib/imw/parsers/html_parser/matchers.rb', line 140

def options
  @options
end

#re ⇒ `Object`

Returns the value of attribute re.



139
140
141

# File 'lib/imw/parsers/html_parser/matchers.rb', line 139

def re
  @re
end

Instance Method Details

#match(doc) ⇒ `Object`

Grab the first element from doc matching the selector this object was initialized with. Use the re and the (optional) capture group this object was initialized with to capture a string (or array of strings if no capture group was specified) from the collected element (treated as text). If initialized with a matcher, then return the matcher‘s match against the value of the capture(s), else just return the capture(s).

m = MatchRegexp.new('span#bio/a.homepage', /Homepage of (.*)$/, nil, :capture => 1 )
m.match('<span id="bio"><a class="homepage" href="http://foo.bar">Homepage of John Chimpo</a></span>')
# => "John Chimpo"

# File 'lib/imw/parsers/html_parser/matchers.rb', line 170

def match doc
  doc = Hpricot(doc) if doc.is_a?(String)        
  el = selector ? doc.contents_of(selector) : doc
  m = re.match(el.to_s)
  val = case
        when m.nil? then nil
        when self.options.key?(:capture) then m.captures[self.options[:capture] - 1] # -1 to match regexp indexing
        else m.captures
        end
  # pass to matcher, if any
  matcher ? matcher.match(val) : val
end