This gem contains an implementation of "Merkle Hash Trees" (MHT). Specifically, it implements the variant described in RFC6962, as the initial use-case for this gem was for an implementation of a Certificate Transparency log server.

Installation

Installation should be trivial, if you're using rubygems:

gem install merkle-hash-tree

If you want to install directly from the git repo, run rake install.

Usage

Using MerkleHashTree is relatively straightforward, although it does have one or two intricacies. Because MHTs typically deal with large volumes of data, it isn't enough to just load a giant list of objects into memory and go to town -- you'll run out of memory pretty quickly, and on a large tree you'll likely burn a lot of CPU time computing hashes. Instead, in order to instantiate an MHT you must first construct an object that implements a specific interface, which the MHT implementation then uses to interact with your dataset.

Basic Usage

For now, though, let's assume that you have such an object, named mht_data, and we'll look at how to use the MHT. (It might be useful to understand how MHT proofs work before you go too deeply into this).

For starters, we'll create a new MHT:

mht = MerkleHashTree.new(mht_data, Digest::SHA256)

The MerkleHashTree constructor takes exactly two arguments: an object that implements the data access interface we'll talk about later, and a class (or object) which implements the same digest method signature as the core Digest::Base class. Typically, this will simply be a Digest subclass, such as Digest::MD5, Digest::SHA1, or (as in the example above) Digest::SHA256. This second argument is the way that the MHT calculates hashes in the tree -- it simply calls the #digest method on whatever you pass in as the second argument, passing in a string and expecting raw octets out the other end.

Once we have our MHT object, we can start to do things with it. For example, we can get the hash of the "head" of the tree:

mht.head   # => "<some long string of octets>"

You can also get the head of any subtree, by specifying the first and last elements of the list to be covered by the subtree:

mht.head(16, 20)   # => "<some more octets>"

Note that the beginning element must be a power of 2.

If you want to get the subtree from 0 to an arbitrary element in the list, you can just specify the last element:

mht.head(42)   # => "<some other long string of octets>"
# equivalent to
mht.head(0, 42)

We can also ask for a "consistency proof" between any two subtrees:

mht.consistency_proof(42, 69)   # => ["<hash>", "<hash>", ... ]

If we want a consistency proof between a subtree and the current head, we can drop the second parameter:

mht.consistency_proof(42)   # => ["<hash>", "<hash>", ... ]

I'm not going to describe Merkle consistency proofs here; the Internet does a far better job than I ever will. The return value of #consistency_proof is simply an array of the hashes that are required by a client to prove that the smaller subtree is, indeed, a subtree of the larger one (and nothing dodgy has gone on behind the scenes). RFC6962, s2.1.2 has all the gory details of how to calculate it and how to use the result.

There are also such things as "audit proofs" (again, I'm not going to explain them here), which you get by specifying a single leaf number and a subtree ID:

mht.audit_proof(13, 42)   # => ["<hash>", "<hash>", ... ]

In this example, the audit proof will return a list of hashes, starting from the leaf node's sibling and working up towards the root node for a hash tree containing 42 elements, that demonstrate that leaf 13 is in the tree and hasn't been removed or altered.

You can also drop the second argument, in which case you get an audit proof for the tree that represents the entire list as it currently exists:

mht.audit_proof(13)   # => ["<hash>", "<hash>", ... ]

And that's it! There really isn't much you can do from the outside. All the fun happens inside.

The Data Access Interface

Rather than trying to work with an entire dataset in memory, MerkleHashTree is capable of working with a dataset far larger than what could fit in memory, by using a data access object to fetch items and cache intermediate results (the hashes of nodes in the tree). To do this, though, a fair number of methods need to be implemented.

How you implement them is up to you -- you could query a backend database, or just make up data as you felt like it. In the minimal case, you can pass in an instance of Array, although I doubt you'll enjoy the performance on any but the smallest possible hash tree.

The complete interface definition is given in doc/DAI.md, for those who wish to implement their own interface. Essentially, you must to implement [](n), which returns the nth entry in the (zero-indexed) list, as well as length, which returns the current size of the list. You can also implement cache_set(n1, n2, s) and cache_get(n1, n2), which set and get entries in the cache of node values. If you don't implement these, then MerkleHashTree will need to recalculate every hash in the tree repeatedly for most every operation -- which will be very slow for anything other than the most trivial result.

As I said before, you can just use Array, if you want to, which could look something like this:

a = Array.new
mht = MerkleHashTree.new(a, Digest::MD5)

a << 'a'
a << 'b'
a << 'c'
a << 'd'
a << 'e'

mht.head   # => "O\xA2\x03\x12\xF6\x0F\xFBtU\x95GY\xE53\x17\x8D"

Further Info

In a reversal of standard operating procedure, I heavily document all the methods and interfaces I write. You can get complete API documentation by using ri (or a descendent thereof), or via your web-based rdoc browser of choice.