Class: WebPageParser::BaseParser

Inherits:

Object

Object
WebPageParser::BaseParser

Includes:: Oniguruma

Defined in:: lib/web-page-parser/base_parser.rb

Overview

BaseParse is designed to be sub-classed to write new parsers. It provides some basic help but most of the work needs to be done by the sub-class.

Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.

Direct Known Subclasses

TestPageParser, BbcNewsPageParserV1, BbcNewsPageParserV2, GuardianPageParserV1

Constant Summary collapse

ICONV =

Iconv.new("utf8", "iso-8859-1")

TITLE_RE = The regular expression to extract the title

//

DATE_RE = The regular expression to extract the date

//

CONTENT_RE = The regular expression to extract the content

//

KILL_CHARS_RE = The regular expression to find all characters that should be removed from any content.

ORegexp.new('[\n\r]+')

HTML_ENTITIES_DECODER = The object used to turn HTML entities into real charaters

HTMLEntities.new

Instance Attribute Summary collapse

#guid ⇒ Object readonly

Returns the value of attribute guid.
#page ⇒ Object readonly

Returns the value of attribute page.
#url ⇒ Object readonly

Returns the value of attribute url.

Instance Method Summary collapse

#content ⇒ Object

The content method returns the important body text of the web page.
#date ⇒ Object

The date method returns a the timestamp of the web page, as a DateTime object.
#decode_entities(s) ⇒ Object

Convert html entities to unicode.
#hash ⇒ Object

Return a hash representing the textual content of this web page.
#initialize(options = { }) ⇒ BaseParser constructor

takes a has of options.
#title ⇒ Object

The title method returns the title of the web page.

Constructor Details

#initialize(options = { }) ⇒ `BaseParser`

takes a has of options. The :url option passes the page url, and the :page option passes the raw html page content for parsing

# File 'lib/web-page-parser/base_parser.rb', line 42

def initialize(options = { })
  @url = options[:url]
  @page = options[:page]
end

Instance Attribute Details

#guid ⇒ `Object` (readonly)

Returns the value of attribute guid.



20
21
22

# File 'lib/web-page-parser/base_parser.rb', line 20

def guid
  @guid
end

#page ⇒ `Object` (readonly)

Returns the value of attribute page.



20
21
22

# File 'lib/web-page-parser/base_parser.rb', line 20

def page
  @page
end

#url ⇒ `Object` (readonly)

Returns the value of attribute url.



20
21
22

# File 'lib/web-page-parser/base_parser.rb', line 20

def url
  @url
end

Instance Method Details

#content ⇒ `Object`

The content method returns the important body text of the web page.

It does basic extraction and pre-processing of the page content and then calls the content_processor method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.

When writing a new parser, the CONTENT_RE constant should be defined in the subclass. The KILL_CHARS_RE constant can be overridden if necessary.

# File 'lib/web-page-parser/base_parser.rb', line 87

def content
  return @content if @content
  matches = class_const(:CONTENT_RE).match(page)
  if matches
    @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '')
    @content = iconv(@content)
    content_processor
    @content.collect! { |p| decode_entities(p.strip) }
    @content.delete_if { |p| p == '' or p.nil? }        
  end
  @content = [] if @content.nil?
  @content
end

#date ⇒ `Object`

The date method returns a the timestamp of the web page, as a DateTime object.

It does the basic extraction using the DATE_RE regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor method.

# File 'lib/web-page-parser/base_parser.rb', line 68

def date
  return @date if @date
  if matches = class_const(:DATE_RE).match(page)
    @date = matches[1].to_s.strip
    date_processor
    @date
  end
end

#decode_entities(s) ⇒ `Object`

Convert html entities to unicode



110
111
112

# File 'lib/web-page-parser/base_parser.rb', line 110

def decode_entities(s)
  HTML_ENTITIES_DECODER.decode(s)
end

#hash ⇒ `Object`

Return a hash representing the textual content of this web page

# File 'lib/web-page-parser/base_parser.rb', line 102

def hash
  digest = Digest::MD5.new
  digest << title.to_s
  digest << content.to_s
  digest.to_s
end

#title ⇒ `Object`

The title method returns the title of the web page.

It does the basic extraction using the TITLE_RE regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor method.

# File 'lib/web-page-parser/base_parser.rb', line 52

def title
  return @title if @title
  if matches = class_const(:TITLE_RE).match(page)
    @title = matches[1].to_s.strip
    title_processor
    @title = iconv(@title)
    @title = decode_entities(@title)
  end
end

Class: WebPageParser::BaseParser

Overview

Direct Known Subclasses

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = { }) ⇒ BaseParser

Instance Attribute Details

#guid ⇒ Object (readonly)

#page ⇒ Object (readonly)

#url ⇒ Object (readonly)

Instance Method Details

#content ⇒ Object

#date ⇒ Object

#decode_entities(s) ⇒ Object

#hash ⇒ Object

#title ⇒ Object