Class: IMW::Parsers::HtmlMatchers::MatchRegexp
- Defined in:
- lib/imw/parsers/html_parser/matchers.rb
Overview
Concrete subclass of IMW::Parsers::HtmlMatchers::Matcher
for using a regular expression to match against text in an HTML document.
Instance Attribute Summary collapse
-
#options ⇒ Object
Returns the value of attribute options.
-
#re ⇒ Object
Returns the value of attribute re.
Attributes inherited from Matcher
Instance Method Summary collapse
-
#initialize(selector, re, matcher = nil, options = {}) ⇒ MatchRegexp
constructor
Use the regular expression
re
to return captures from the elements collected byselector
(treated as text) used on an HTML document (ifselector
isnil
then match against the full text of the document). -
#match(doc) ⇒ Object
Grab the first element from
doc
matching theselector
this object was initialized with.
Constructor Details
#initialize(selector, re, matcher = nil, options = {}) ⇒ MatchRegexp
Use the regular expression re
to return captures from the elements collected by selector
(treated as text) used on an HTML document (if selector
is nil
then match against the full text of the document). If the keyword argument :capture
is specified then return the corresponding group (indexing is that of regular expressions; “1” is the first capture), else return an array of all captures. If matcher
, then use it on the capture(s) before returning.
FIXME Shouldn’t the matcher come BEFORE the regexp capture, not after?
153 154 155 156 157 |
# File 'lib/imw/parsers/html_parser/matchers.rb', line 153 def initialize selector, re, matcher=nil, ={} super selector, matcher self. = self.re = re end |
Instance Attribute Details
#options ⇒ Object
Returns the value of attribute options.
140 141 142 |
# File 'lib/imw/parsers/html_parser/matchers.rb', line 140 def @options end |
#re ⇒ Object
Returns the value of attribute re.
139 140 141 |
# File 'lib/imw/parsers/html_parser/matchers.rb', line 139 def re @re end |
Instance Method Details
#match(doc) ⇒ Object
Grab the first element from doc
matching the selector
this object was initialized with. Use the re
and the (optional) capture group this object was initialized with to capture a string (or array of strings if no capture group was specified) from the collected element (treated as text). If initialized with a matcher
, then return the matcher
‘s match against the value of the capture(s), else just return the capture(s).
m = MatchRegexp.new('span#bio/a.homepage', /Homepage of (.*)$/, nil, :capture => 1 )
m.match('<span id="bio"><a class="homepage" href="http://foo.bar">Homepage of John Chimpo</a></span>')
# => "John Chimpo"
170 171 172 173 174 175 176 177 178 179 180 181 |
# File 'lib/imw/parsers/html_parser/matchers.rb', line 170 def match doc doc = Hpricot(doc) if doc.is_a?(String) el = selector ? doc.contents_of(selector) : doc m = re.match(el.to_s) val = case when m.nil? then nil when self..key?(:capture) then m.captures[self.[:capture] - 1] # -1 to match regexp indexing else m.captures end # pass to matcher, if any matcher ? matcher.match(val) : val end |