Class: WebPageParser::BbcNewsPageParserV1

Inherits:
BaseRegexpParser show all
Defined in:
lib/web-page-parser/parsers/bbc_news_page_parser.rb

Overview

BbcNewsPageParserV1 parses BBC News web pages exactly like the old News Sniffer BbcNewsPage class did. This should only ever be used for backwards compatability with News Sniffer and is never supplied for use by a factory.

Constant Summary collapse

TITLE_RE =
ORegexp.new('<meta name="Headline" content="(.*)"', 'i')
DATE_RE =
ORegexp.new('<meta name="OriginalPublicationDate" content="(.*)"', 'i')
CONTENT_RE =
ORegexp.new('S (?:SF) -->(.*?)<!-- E BO', 'm')
STRIP_TAGS_RE =
ORegexp.new('</?(div|img|tr|td|!--|table)[^>]*>','i')
WHITESPACE_RE =
ORegexp.new('\t|')
PARA_RE =
Regexp.new(/<p>/i)

Constants inherited from BaseRegexpParser

WebPageParser::BaseRegexpParser::HTML_ENTITIES_DECODER, WebPageParser::BaseRegexpParser::KILL_CHARS_RE

Instance Attribute Summary

Attributes inherited from BaseParser

#url

Instance Method Summary collapse

Methods inherited from BaseRegexpParser

#content, #date, #decode_entities, #encode, #initialize, #page, #retrieve_page, #title

Methods inherited from BaseParser

#content, #date, #guid, #guid_from_url, #initialize, #page, #retrieve_page, #title

Constructor Details

This class inherits a constructor from WebPageParser::BaseRegexpParser

Instance Method Details

#hashObject



35
36
37
38
# File 'lib/web-page-parser/parsers/bbc_news_page_parser.rb', line 35

def hash
  # Old News Sniffer only hashed the content, not the title
  Digest::MD5.hexdigest(content.join('').to_s)
end