Class: WebPageParser::BbcNewsPageParserV2
- Inherits:
-
BaseRegexpParser
- Object
- BaseParser
- BaseRegexpParser
- WebPageParser::BbcNewsPageParserV2
- Defined in:
- lib/web-page-parser/parsers/bbc_news_page_parser.rb
Overview
BbcNewsPageParserV2 parses BBC News web pages
Direct Known Subclasses
Constant Summary collapse
- TITLE_RE =
ORegexp.new('<meta name="Headline" content="(.*)"', 'i')
- DATE_RE =
ORegexp.new('<meta name="OriginalPublicationDate" content="(.*)"', 'i')
- CONTENT_RE =
ORegexp.new('S BO -->(.*?)<!-- E BO', 'm')
- STRIP_BLOCKS_RE =
ORegexp.new('<(table|noscript|script|object|form)[^>]*>.*?</\1>', 'i')
- STRIP_TAGS_RE =
ORegexp.new('</?(b|div|img|tr|td|br|font|span)[^>]*>','i')
- STRIP_COMMENTS_RE =
ORegexp.new('<!--.*?-->')
- STRIP_CAPTIONS_RE =
ORegexp.new('<!-- caption .+?<!-- END - caption -->')
- WHITESPACE_RE =
ORegexp.new('[\t ]+')
- PARA_RE =
Regexp.new('</?p[^>]*>', Regexp::IGNORECASE)
Constants inherited from BaseRegexpParser
WebPageParser::BaseRegexpParser::HTML_ENTITIES_DECODER, WebPageParser::BaseRegexpParser::KILL_CHARS_RE
Instance Attribute Summary
Attributes inherited from BaseParser
Method Summary
Methods inherited from BaseRegexpParser
#content, #date, #decode_entities, #encode, #initialize, #page, #retrieve_page, #title
Methods inherited from BaseParser
#content, #date, #guid, #guid_from_url, #hash, #initialize, #page, #retrieve_page, #title
Constructor Details
This class inherits a constructor from WebPageParser::BaseRegexpParser