Nokogumbo - a Nokogiri interface to the Gumbo HTML5 parser.
At the moment, this is a proof of concept, allowing a Ruby program to invoke the Gumbo HTML5 parser and access the result as a Nokogiri parsed document.
Usage:
require 'nokogumbo'
doc = Nokogiri::HTML5(string)
Notes:
The
Nokogumbo.parsefunction takes a string and passes it to thegumbo_parse_with_optionsmethod, using the default options. The resulting Gumbo parse tree is the walked, producing a Nokogiri parse tree. The original Gumbo parse tree is then destroyed, and the Nokogiri parse tree is returned.Instead of uppercase element names, lowercase element names are produced.
Instead of returning 'unknown' as the element name for unknown tags, the original tag name is returned verbatim.
Nothing meaningful is done with the
GumboDocumentstruct, i.e., no NokogiriEntityDeclis produced.The gem itself includes a copy of the Nokogumbo HTML5 parser.
Installation:
Execute
rake gem[sudo] gem install pkg/nokogumbo*.gem
Related efforts:
- ruby-gumbo - a ruby binding for the Gumbo HTML5 parser.