Yajl::FFI

Yajl::FFI is a JSON parser, based on FFI bindings into the native YAJL library, that generates events for each state change. This allows streaming both the JSON document into memory and the parsed object graph out of memory to some other process.

This is similar to an XML SAX parser that generates events during parsing. There is no requirement for the document, or the object graph, to be fully buffered in memory. Yajl::FFI is best suited for huge JSON documents that won't fit in memory.

Usage

The simplest way to parse is to read the full JSON document into memory and then parse it into a full object graph. This is fine for small documents because we have room for both the text and parsed object in memory.

require 'yajl/ffi'
json = File.read('/tmp/test.json')
obj = Yajl::FFI::Parser.parse(json)

While it's possible to do this with Yajl::FFI, we should really use the standard library's [json](json gem for documents like this. It's faster because it doesn't need to generate events and notify observers each time the parser changes state. It parses and builds the Ruby object entirely in native code and hands it back to us, fully formed.

For larger documents, we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.

require 'yajl/ffi'
stream = File.open('/tmp/test.json')
obj = Yajl::FFI::Parser.parse(stream)

However, when streaming small documents from disk, or over the network, the yajl-ruby gem will give us the best performance.

Huge documents arriving over the network in small chunks to an EventMachine receive_data loop is where Yajl::FFI is uniquely suited. Inside an EventMachine::Connection subclass we might have:

def post_init
  @parser = Yajl::FFI::Parser.new
  @parser.start_document { puts "start document" }
  @parser.end_document   { puts "end document" }
  @parser.start_object   { puts "start object" }
  @parser.end_object     { puts "end object" }
  @parser.start_array    { puts "start array" }
  @parser.end_array      { puts "end array" }
  @parser.key            {|k| puts "key: #{k}" }
  @parser.value          {|v| puts "value: #{v}" }
end

def receive_data(data)
  begin
    @parser << data
  rescue Yajl::FFI::ParserError => e
    close_connection
  end
end

The parser accepts chunks of the JSON document and parses up to the end of the available buffer. Passing in more data resumes the parse from the prior state. When an interesting state change happens, the parser notifies all registered callback procs of the event.

The event callback is where we can do interesting data filtering and passing to other processes. The above example simply prints state changes, but the callbacks might look for an array named rows and process sets of these row objects in small batches. Millions of rows, streaming over the network, can be processed in constant memory space this way.

Dependencies

Library loading

FFI uses the the dlopen system call to dynamically load the libyajl library into memory at runtime. It searches the usual directories for the library file, like /usr/lib and /usr/local/lib, and raises an error if it's not found. If libyajl is installed in an unusual directory, we can tell dlopen where to look by setting the LD_LIBRARY_PATH environment variable.

# test normal library load
$ ruby -r 'yajl/ffi' -e 'puts Yajl::FFI::VERSION'

# if it fails, specify the search path
$ LD_LIBRARY_PATH=/somewhere/yajl/lib \
  ruby -r 'yajl/ffi' -e 'puts Yajl::FFI::VERSION'

Installation

The libyajl library needs to be installed before this gem can bind to it.

OS X

Use Homebrew or compile from source below.

$ brew install yajl

Fedora

Fedora 20 provides libyajl2 in a package. Older versions might need to compile the latest yajl version from source.

$ sudo yum install yajl

Ubuntu

Ubuntu 14.04 provides a libyajl2 package. Older versions might also need to compile yajl from source.

$ sudo apt-get install libyajl2

Source

By default, this compiles and installs to /usr/local. Use ./configure -p /tmp/somewhere to install to a different directory. Setting LD_LIBRARY_PATH will be required in that case.

$ git clone https://github.com/lloyd/yajl
$ cd yajl
$ ./configure
$ make && make install

Alternatives

This gem provides a benchmark script to test the relative performance of several parsers. Here's a sample run.

$ rake benchmark
                  user     system      total        real
json          0.130000   0.010000   0.140000 (  0.136225)
yajl-ruby     0.130000   0.010000   0.140000 (  0.133872)
yajl-ffi      0.800000   0.010000   0.810000 (  0.812818)
json-stream   9.500000   0.080000   9.580000 (  9.580571)

Yajl::FFI is about 6x slower than the pure native parsers. JSON::Stream is a pure Ruby parser, and it performs accordingly. But it's useful in cases where you're unable to use native bindings or when the limiting factor is the network, rather than processor speed.

So if you need to parse many small JSON documents, the json and yajl-ruby gems are the best options. If you need to stream, and incrementally parse, pieces of a large document in constant memory space, yajl-ffi and json-stream are good choices.

License

Yajl::FFI is released under the MIT license. Check the LICENSE file for details.