Class: WebPageParser::BaseParser
- Inherits:
-
Object
- Object
- WebPageParser::BaseParser
- Includes:
- Oniguruma
- Defined in:
- lib/web-page-parser/base_parser.rb
Overview
BaseParse is designed to be sub-classed to write new parsers. It provides some basic help but most of the work needs to be done by the sub-class.
Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.
Direct Known Subclasses
TestPageParser, BbcNewsPageParserV1, BbcNewsPageParserV2, GuardianPageParserV1
Constant Summary collapse
- ICONV =
Iconv.new("utf8", "iso-8859-1")
- TITLE_RE =
The regular expression to extract the title
//
- DATE_RE =
The regular expression to extract the date
//
- CONTENT_RE =
The regular expression to extract the content
//
- KILL_CHARS_RE =
The regular expression to find all characters that should be removed from any content.
ORegexp.new('[\n\r]+')
- HTML_ENTITIES_DECODER =
The object used to turn HTML entities into real charaters
HTMLEntities.new
Instance Attribute Summary collapse
-
#guid ⇒ Object
readonly
Returns the value of attribute guid.
-
#page ⇒ Object
readonly
Returns the value of attribute page.
-
#url ⇒ Object
readonly
Returns the value of attribute url.
Instance Method Summary collapse
-
#content ⇒ Object
The content method returns the important body text of the web page.
-
#date ⇒ Object
The date method returns a the timestamp of the web page, as a DateTime object.
-
#decode_entities(s) ⇒ Object
Convert html entities to unicode.
-
#hash ⇒ Object
Return a hash representing the textual content of this web page.
-
#initialize(options = { }) ⇒ BaseParser
constructor
takes a has of options.
-
#title ⇒ Object
The title method returns the title of the web page.
Constructor Details
#initialize(options = { }) ⇒ BaseParser
takes a has of options. The :url option passes the page url, and the :page option passes the raw html page content for parsing
42 43 44 45 |
# File 'lib/web-page-parser/base_parser.rb', line 42 def initialize( = { }) @url = [:url] @page = [:page] end |
Instance Attribute Details
#guid ⇒ Object (readonly)
Returns the value of attribute guid.
20 21 22 |
# File 'lib/web-page-parser/base_parser.rb', line 20 def guid @guid end |
#page ⇒ Object (readonly)
Returns the value of attribute page.
20 21 22 |
# File 'lib/web-page-parser/base_parser.rb', line 20 def page @page end |
#url ⇒ Object (readonly)
Returns the value of attribute url.
20 21 22 |
# File 'lib/web-page-parser/base_parser.rb', line 20 def url @url end |
Instance Method Details
#content ⇒ Object
The content method returns the important body text of the web page.
It does basic extraction and pre-processing of the page content and then calls the content_processor method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.
When writing a new parser, the CONTENT_RE constant should be defined in the subclass. The KILL_CHARS_RE constant can be overridden if necessary.
87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/web-page-parser/base_parser.rb', line 87 def content return @content if @content matches = class_const(:CONTENT_RE).match(page) if matches @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '') @content = iconv(@content) content_processor @content.collect! { |p| decode_entities(p.strip) } @content.delete_if { |p| p == '' or p.nil? } end @content = [] if @content.nil? @content end |
#date ⇒ Object
The date method returns a the timestamp of the web page, as a DateTime object.
It does the basic extraction using the DATE_RE regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor method.
68 69 70 71 72 73 74 75 |
# File 'lib/web-page-parser/base_parser.rb', line 68 def date return @date if @date if matches = class_const(:DATE_RE).match(page) @date = matches[1].to_s.strip date_processor @date end end |
#decode_entities(s) ⇒ Object
Convert html entities to unicode
110 111 112 |
# File 'lib/web-page-parser/base_parser.rb', line 110 def decode_entities(s) HTML_ENTITIES_DECODER.decode(s) end |
#hash ⇒ Object
Return a hash representing the textual content of this web page
102 103 104 105 106 107 |
# File 'lib/web-page-parser/base_parser.rb', line 102 def hash digest = Digest::MD5.new digest << title.to_s digest << content.to_s digest.to_s end |
#title ⇒ Object
The title method returns the title of the web page.
It does the basic extraction using the TITLE_RE regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor method.
52 53 54 55 56 57 58 59 60 |
# File 'lib/web-page-parser/base_parser.rb', line 52 def title return @title if @title if matches = class_const(:TITLE_RE).match(page) @title = matches[1].to_s.strip title_processor @title = iconv(@title) @title = decode_entities(@title) end end |