htmltokenizer README

htmltokenizer is a port of the idea behind Perl's HTML::TokeParser::Simple.
The basic concept is that it treats a web page as a series of tokens, which 
are either text, html tags, or html comments.  This class provides a way 
of getting these tokens in sequence, either one at a time regardless of 
type, or by choosing a list of interesting tags.

Requirements


* ruby

Install


De-Compress archive and enter its top directory.
Then type:

  $ ruby install.rb config
  $ ruby install.rb setup
  $ su -c "ruby install.rb install"

or

  $ ruby install.rb config
  $ ruby install.rb setup
  $ sudo ruby install.rb install

You can also install files into your favorite directory
by supplying install.rb some options. Try "ruby install.rb --help".

Usage


require ‘html/htmltokenizer’

page = getSomePageFromTheInternetAsAString()

tokenizer = HTMLTokenizer.new(page)

while token = tokenizer.getTag(‘a’, ‘font’, ‘/tr’, ‘div’)

if 'div' == token.tag_name
  if 'headlinesheader' == token.attr_hash['class']
    puts "Header is: " + tokenizer.getTrimmedText('/div')
  else
    tokenizer.getTag('/div')
    token = tokenizer.getTag('a')
    if token.attr_hash['href']
      puts "Found a link after a div going to #{token.attr_hash['href']}"
    end
  end
end

end

License


Ruby's license, see http://www.ruby-lang.org/en/LICENSE.txt

Ben Giddings <[email protected]>