Gammo - A pure-Ruby HTML5 parser

Build Status

Gammo is an implementation of the HTML5 parsing algorithm which conforms the WHATWG specification, without any dependencies. Given an HTML string, Gammo parses it and builds DOM tree based on the tokenization and tree-construction algorithm defined in WHATWG parsing algorithm.

Gammo, its naming is inspired by Gumbo. But Gammo is a fried tofu fritter made with vegetables.

require 'gammo'
require 'open-uri'

parser = Gammo.new(open('https://google.com'))
parser.parse #=> #<Gammo::Node::Document>

Overview

Features

Tokenizaton

Gammo::Tokenizer implements the tokenization algorithm in WHATWG. You can get tokens in order by calling Gammo::Tokenizer#next_token.

Here is a simple example for performing only the tokenizer.

def dump_for(token)
  puts "data: #{token.data}, class: #{token.class}"
end

tokenizer = Gammo::Tokenizer.new('<!doctype html><input type="button"><frameset>')
dump_for tokenizer.next_token #=> data: html, class: Gammo::Tokenizer::DoctypeToken
dump_for tokenizer.next_token #=> data: input, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: frameset, class: Gammo::Tokenizer::StartTagToken
dump_for tokenizer.next_token #=> data: end of string, class: Gammo::Tokenizer::ErrorToken

The parser described below depends on this tokenizer, it applies the WHATWG parsing algorithm to the tokens extracted by this tokenization in order.

Token types

The tokens generated by the tokenizer will be categorized into one of the following types:

Token type Description
Gammo::Tokenizer::ErrorToken Represents an error token, it usually means end-of-string.
Gammo::Tokenizer::TextToken Represents a text token like "foo" which is inner text of elements.
Gammo::Tokenizer::StartTagToken Represents a start tag token like <a>.
Gammo::Tokenizer::EndTagToken Represents an end tag token like </a>.
Gammo::Tokenizer::SelfClosingTagToken Represents a self closing tag token like <img />
Gammo::Tokenizer::CommentToken Represents a comment token like <!-- comment -->.
Gammo::Tokenizer::DoctypeToken Represents a doctype token like <!doctype html>.

Parsing

Gammo::Parser implements processing in the tree-construction stage based on the tokenization described above.

A successfully parsed parser has the document accessor as the root document (this is the same as the return value of the Gammo::Parser#parse). From the document accessor, you can traverse the DOM tree constructed by the parser.

require 'gammo'
require 'pp'

document = Gammo.new('<!doctype html><input type="button">').parse

def dump_for(node, strm)
  strm << node.to_h
  return unless node && (child = node.first_child)
  while child
    dump_for(child, (strm.last[:children] ||= []))
    child = child.next_sibling
  end
  strm
end

pp dump_for(document, [])

Notes

Currently, it's not possible to traverse the DOM tree with css selector or xpath like Nokogiri. However, Gammo plans to implement these features in the future.

Node

The nodes generated by the parser will be categorized into one of the following types:

Node type Description
Gammo::Node::Error Represents error node, it usually means end-of-string.
Gammo::Node::Text Represents the text node like "foo" which is inner text of elements.
Gammo::Node::Document Represents the root document type. It's always returned by Gammo::Parser#document.
Gammo::Node::Element Represents any elements of HTML like <p>.
Gammo::Node::Comment Represents comments like <!-- foo -->
Gammo::Node::Doctype Represents doctype like <!doctype html>

For some nodes such as Gammo::Node::Element and Gammo::Node::Document, they contains pointers to nodes that can be referenced by itself, such as Gammo::Node#next_sibling or Gammo::Node#first_child. In addition, APIs such as Gammo::Node#append_child and Gammo::Node#remove_child that perform operations defined in DOM living standard are also provided.

Performance

As mentioned in the features at the beginning, Gammo doesn't prioritize its performance. Thus, for example, Gammo is not suitable for very performance-sensitive applications (e.g. performing Gammo parsing synchronously from an incoming request from an end user). Instead, the goal is to work well with batch processing such as crawlers. Gammo places the highest priority on making it easy to parse HTML by peforming it without depending on native-extensions and external gems.

References

This was developed with reference to the following softwares.

  • x/net/html: I've been working on this package, it gave me strong reason to make this happen.
  • Blink: Blink gave me great impression about tree construction.
  • html5lib-tests: Gammo relies on this test.

License

The gem is available as open source under the terms of the MIT License.