Class: WebPageParser::GuardianPageParserV1
- Inherits:
-
BaseRegexpParser
- Object
- BaseParser
- BaseRegexpParser
- WebPageParser::GuardianPageParserV1
- Defined in:
- lib/web-page-parser/parsers/guardian_page_parser.rb
Overview
GuardianPageParserV1 parses Guardian web pages using regexps
Constant Summary collapse
- TITLE_RE =
ORegexp.new('<meta property="og:title" content="(.*)"', 'i')
- DATE_RE =
ORegexp.new('<meta property="article:published_time" content="(.*)"', 'i')
- CONTENT_RE =
ORegexp.new('article-body-blocks">(.*?)<div id="related"', 'm')
- STRIP_TAGS_RE =
ORegexp.new('</?(a|span|div|img|tr|td|!--|table)[^>]*>','i')
- STRIP_SCRIPTS_RE =
ORegexp.new('<script[^>]*>.*?</script>','i')
- PARA_RE =
Regexp.new(/<(p|h2)[^>]*>(.*?)<\/\1>/i)
Constants inherited from BaseRegexpParser
BaseRegexpParser::HTML_ENTITIES_DECODER, BaseRegexpParser::KILL_CHARS_RE
Instance Attribute Summary
Attributes inherited from BaseParser
Method Summary
Methods inherited from BaseRegexpParser
#content, #date, #decode_entities, #encode, #initialize, #page, #retrieve_page, #title
Methods inherited from BaseParser
#content, #date, #guid, #guid_from_url, #hash, #initialize, #page, #retrieve_page, #title
Constructor Details
This class inherits a constructor from WebPageParser::BaseRegexpParser