Class: WebPageParser::BaseParser

Inherits:
Object
  • Object
show all
Includes:
Oniguruma
Defined in:
lib/web-page-parser/base_parser.rb

Overview

BaseParse is designed to be sub-classed to write new parsers. It provides some basic help but most of the work needs to be done by the sub-class.

Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.

Constant Summary collapse

ICONV =
Iconv.new("utf8", "iso-8859-1")
TITLE_RE =

The regular expression to extract the title

//
DATE_RE =

The regular expression to extract the date

//
CONTENT_RE =

The regular expression to extract the content

//
KILL_CHARS_RE =

The regular expression to find all characters that should be removed from any content.

ORegexp.new('[\n\r]+')
HTML_ENTITIES_DECODER =

The object used to turn HTML entities into real charaters

HTMLEntities.new

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = { }) ⇒ BaseParser

takes a has of options. The :url option passes the page url, and the :page option passes the raw html page content for parsing



42
43
44
45
# File 'lib/web-page-parser/base_parser.rb', line 42

def initialize(options = { })
  @url = options[:url]
  @page = options[:page]
end

Instance Attribute Details

#guidObject (readonly)

Returns the value of attribute guid.



20
21
22
# File 'lib/web-page-parser/base_parser.rb', line 20

def guid
  @guid
end

#pageObject (readonly)

Returns the value of attribute page.



20
21
22
# File 'lib/web-page-parser/base_parser.rb', line 20

def page
  @page
end

#urlObject (readonly)

Returns the value of attribute url.



20
21
22
# File 'lib/web-page-parser/base_parser.rb', line 20

def url
  @url
end

Instance Method Details

#contentObject

The content method returns the important body text of the web page.

It does basic extraction and pre-processing of the page content and then calls the content_processor method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.

When writing a new parser, the CONTENT_RE constant should be defined in the subclass. The KILL_CHARS_RE constant can be overridden if necessary.



87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/web-page-parser/base_parser.rb', line 87

def content
  return @content if @content
  matches = class_const(:CONTENT_RE).match(page)
  if matches
    @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '')
    @content = iconv(@content)
    content_processor
    @content.collect! { |p| decode_entities(p.strip) }
    @content.delete_if { |p| p == '' or p.nil? }        
  end
  @content = [] if @content.nil?
  @content
end

#dateObject

The date method returns a the timestamp of the web page, as a DateTime object.

It does the basic extraction using the DATE_RE regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor method.



68
69
70
71
72
73
74
75
# File 'lib/web-page-parser/base_parser.rb', line 68

def date
  return @date if @date
  if matches = class_const(:DATE_RE).match(page)
    @date = matches[1].to_s.strip
    date_processor
    @date
  end
end

#decode_entities(s) ⇒ Object

Convert html entities to unicode



110
111
112
# File 'lib/web-page-parser/base_parser.rb', line 110

def decode_entities(s)
  HTML_ENTITIES_DECODER.decode(s)
end

#hashObject

Return a hash representing the textual content of this web page



102
103
104
105
106
107
# File 'lib/web-page-parser/base_parser.rb', line 102

def hash
  digest = Digest::MD5.new
  digest << title.to_s
  digest << content.to_s
  digest.to_s
end

#titleObject

The title method returns the title of the web page.

It does the basic extraction using the TITLE_RE regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor method.



52
53
54
55
56
57
58
59
60
# File 'lib/web-page-parser/base_parser.rb', line 52

def title
  return @title if @title
  if matches = class_const(:TITLE_RE).match(page)
    @title = matches[1].to_s.strip
    title_processor
    @title = iconv(@title)
    @title = decode_entities(@title)
  end
end