Saxxy A Ruby DSL for SAX parsers Build Status Code Climate

Saxxy is designed to be a DSL for creating SAX parsers. If anyone tells you that you are masochist 'cause you are SAX parsing HTML show her Saxxy.

It currently supports Nokogiri, Ox, LibXML and is really easy to implement your own parser bindings. It can parse XML out of the box but HTML SAX parsing heavily depends on how the parser handles HTML. Libxml cannot handle malformed HTML at all. Ox and Nokogiri handles the parsing of HTML (even malformed) really well and thus I recommend them.


Saxxy requires Ruby >=1.9 or JRuby with JRUBY_OPTS=--1.9


Add this line to your application's Gemfile:

gem 'saxxy'

Or install it independently of Bundler

$ gem install saxxy

Getting started


First you must create a service object with a specified parser. It accepts a symbol (:nokogiri, :libxml, :ox) or a class if you made your own parser implementation. It will create a context tree (see Saxxy::ContextTree for more details) and will register the callbacks it will call when parsing, as soon as you provide a block. E.g.

require "saxxy/parsers/nokogiri"

service = do
  under("div", class: /cool$/) do
    on(/span|div/, rel: "foo") do |inner_text, element, attributes|
      puts "Under a #{element} found some text: " + inner_text

    under("table", class: "main") do
      under("tr", class: "header") do
        on("td") do |inner_text, element, attributes|
          puts "Found some other text in a table cell: " + inner_text

The service provides either parse_file, parse_string or parse_io methods, depending on you needs. Every method accepts it's corresponding source (with the respective source type) as first argument and an optional encoding as a second argument.

service.parse_string <<-eos
      Hey I am in a span! <em>And I am nested in a span!</em>
      Hey I am in a div!

# => Under a span found some text: Hey I am in a span! And I am nested in a span!
# => Under a div found some text: Hey I am in a div!

If the parser doesn't raise some funny error you should be seeing your registered callbacks getting called with the text, the element name and the attributes found at the matching node.


Saxxy uses a DSL in order to create a context tree and register callbacks. The two most significant methods for doing so is on and under. The on method is used to signify a specific condition and the block it accepts is the callback it will run when the condition is met on a node.

The following example shows a callback that is run when the parser encounters a header element with a class that matches /foo$/

on(/^h[1-6]{1}/, class: /foo$/) do |text, element, attributes|
  p "Element name is: #{element} and the inner text is: #{text}".

There is now the case where you want to restrict the range of the on call only, say, to headers inside a div element with a class footer. To do that you nest the on in an under call which is used for restricting callbacks' range. E.g.

under("div", class: "footer") do
  on(/^h[1-6]{1}/, class: /foo$/) do |text, element, attributes|
    p "Element name is: #{element} and the inner text is: #{text}".


You can find the documentation here.


  1. Add support for a clean DSL for easily constructing highly nested contexts
  2. Switch to a lazy evaluated context tree
  3. Add more integration tests

Known Issues


No issues


No issues


  1. Does not handle the malformed HTML (raises exceptions)
  2. Triggers twice the callbacks on the nodes


  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request