Version: 1.06

  1. September, 2003

This is a Ruby library for building trees representing HTML structure.

See the file INSTALL for installation instructions.

Copyright (C) 2003, Johannes Brodwall <> Copyright (C) 2002, Ned Konz <>

License: Ruby's

See for the most recent version.

This project includes SGML-parser, ported from Python by Takahiro Maebashi <> (see:


Ruby 1.8


The tests run using Test::Unit. Test::Unit is part of the standard Ruby install as of 1.8


XPath support requires REXML. REXML is part of the standard Ruby install as of 1.8


Changes from 1.09:

  • Some minor bugfixes

  • SGMLParser.src_range makes it very easy to write applications which parse HTML files into components and manipulate the corresponding source code without altering it. (by Philip Dorrell)

Changes from 1.08:

  • Fixed xpath script and added tests

  • Fixed bug #681 (xhtml)

  • Added GemSpec

Changes from 1.07:

  • Fixed tc_xpath test_match_all after it was broken by upgrade of REXML.

  • Refactored utility code for printing node paths into rexml-nodepath.rb

Changes from 1.06:

  • Included stuff that I had forgot to package into the tarball.

Changes from 1.05:

  • Updated everything to work with Ruby 1.8.

Changes from 1.04:

  • Made sure that unknown entities and characters are not discarded, in both html/tree.rb and html/xmltree.rb

  • Added handling of DOCTYPE to html/xmltree.rb

Changes from 1.03:

  • Added HTMLTree::XMLParser, which makes a REXML document from the given HTML.

  • Changed HTMLTree::Element::print_on() to write()

  • Made it so that a string or IO can be passed to HTMLTree::Element::dump()

  • Made it so that a string or IO can be passed to HTMLTree::Element::write()

Changes from 1.02:

  • added XPath and XML conversion (needs REXML)

  • Wrapped all code in namespaces. The following class names have changed:

    – in html/element.rb HTMLDocument => HTMLTree::Document HTMLElement => HTMLTree::Element HTMLData => HTMLTree::Data HTMLComment => HTMLTree::Comment HTMLSpecial => HTMLTree::Special

    – in html/tags.rb HTMLTag => HTML::Tag HTMLBlockTag => HTML::BlockTag HTMLInlineTag => HTML::InlineTag HTMLBlockOrInlineTag => HTML::BlockOrInlineTag HTMLEmptyTag => HTML::EmptyTag

    – in html/tree.rb HTMLTreeParser => HTMLTree::Parser

    – in html/stparser.rb StackingParser => HTML::StackingParser

  • added HTMLTree::Element.root()

Changes from 1.01:

  • documented change to sgml-parser.

  • added bin/ebaySearch.rb example

Changes from 1.0:

  • attributes now maintain their order. Though this probably isn't strictly necessary under HTML, it may make it easier to compare document versions.

  • the generated tree now has a top-level node for the document itself, so the DTD can be stored. THIS WILL REQUIRE CODE CHANGES if you have code that assumes that the root node is always <html>. To find the <html> node, you can use the new methods HTMLTreeParser#html() or HTMLDocument#html_node():

    html = parser.html()

    Or, querying the tree:

    html = parser.tree.html_node()
  • comments are stored in the tree

  • added HTMLElement#print_on() to print a (sub)tree to an IO stream

vim: ts=2 sw=2 et