Class: WebPageParser::BbcNewsPageParserV1
- Inherits:
-
BaseRegexpParser
- Object
- BaseParser
- BaseRegexpParser
- WebPageParser::BbcNewsPageParserV1
- Defined in:
- lib/web-page-parser/parsers/bbc_news_page_parser.rb
Overview
BbcNewsPageParserV1 parses BBC News web pages exactly like the old News Sniffer BbcNewsPage class did. This should only ever be used for backwards compatability with News Sniffer and is never supplied for use by a factory.
Constant Summary collapse
- TITLE_RE =
ORegexp.new('<meta name="Headline" content="(.*)"', 'i')
- DATE_RE =
ORegexp.new('<meta name="OriginalPublicationDate" content="(.*)"', 'i')
- CONTENT_RE =
ORegexp.new('S (?:SF) -->(.*?)<!-- E BO', 'm')
- STRIP_TAGS_RE =
ORegexp.new('</?(div|img|tr|td|!--|table)[^>]*>','i')
- WHITESPACE_RE =
ORegexp.new('\t|')
- PARA_RE =
Regexp.new(/<p>/i)
Constants inherited from BaseRegexpParser
WebPageParser::BaseRegexpParser::HTML_ENTITIES_DECODER, WebPageParser::BaseRegexpParser::KILL_CHARS_RE
Instance Attribute Summary
Attributes inherited from BaseParser
Instance Method Summary collapse
Methods inherited from BaseRegexpParser
#content, #date, #decode_entities, #encode, #initialize, #page, #retrieve_page, #title
Methods inherited from BaseParser
#content, #date, #guid, #guid_from_url, #initialize, #page, #retrieve_page, #title
Constructor Details
This class inherits a constructor from WebPageParser::BaseRegexpParser
Instance Method Details
#hash ⇒ Object
35 36 37 38 |
# File 'lib/web-page-parser/parsers/bbc_news_page_parser.rb', line 35 def hash # Old News Sniffer only hashed the content, not the title Digest::MD5.hexdigest(content.join('').to_s) end |