Beanstalk + Tokyo Tyrant = Edamame, a fast persistent distributed priority job queue

Edamame combines the Beanstalk priority queue with a Tokyo Tyrant database and God monitoring to produce a persistent distributed priority job queue system.

fast, scalable, lightweight and distributed
persistent and recoverable
scalable up to your memory limits
queryable and enumerable jobs
named jobs
reasonably-good availability.

Like beanstalk, it is a job queue, not just a message queue:

priority job scheduling, not just FIFO
Supports multiple queues (‘tubes’)
reliable scheduling: jobs that time out are re-assigned

It includes a few nifty toys:

Scripts for God to monitor and restart the daemons
Command-line management scripts to load. enumerate, empty, and show stats for the db+queue
The start of a lightweight web frontend in Sinatra.

Hi,

I’ve slapped together the beanstalk distributed priority queue (http://bit.ly/beanstalkd) and the Tokyo Tyrant lightweight database (http://bit.ly/ttyrant – http://bit.ly/ttyrantruby) to create a serviceable persistent distributed priority queue.

If you’re willing to accept its weaknesses, it gives you

persistence
queryability and enumeration of jobs
named jobs

Design

Jobs are persisted to a tokyo tyrant backing store. Beanstalkd

(No failover or discovery, but yes restarting and reloading.)

Caveats

Weaknesses? Mainly that it will make an Erlang’er cry for its lack of concurrency correctness. Its goal is to work pretty well and to recover gracefully, but its design limits .

We store jobs in two places: the central DB and the distributed queue.

As always, your jobs must either be idempotent, or harmless if re-run: a job could start and do some or all of its job — but lose contact with the queue, causing the job to be re-run. This is inherent in beanstalkd (and most comparable solutions), not just edamame.

Although God

TODOs

Restarting is still manual: you have to run bin/sync.rb to reload the queue from the database
Right now each jobs lives in full in the beanstalkd. This carries a heavy memory cost but a performance gain (the alternative is to do a second query to the DB once the job has been retrieved). I’m going to make both implementations available.
The sinatra queue viewer doesn’t work at the moment.

Requirements and Installation

For the beanstalk part, You’ll need libevent >= 1.4, beanstalkd >= 1.3, and beanstalk-client


    cd /usr/local/src/ ;
    pkg=libevent-1.4.11-stable ;( wget -nc http://monkey.org/~provos/${pkg}.tar.gz     && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local                         && make -j2 && sudo make install ) ;
    pkg=beanstalkd-1.3         ;( wget -nc http://xph.us/dist/beanstalkd/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local --with-event=/usr/local && make -j2 && sudo make install ) ;
    sudo gem install --no-ri --no-rdoc dustin-beanstalk-client ;

For the tokyotyrant part, you’ll need tokyocabinet, tokyotyrant and their corresponding ruby libraries:


    cd /usr/local/src/ ;
    ttbase=http://downloads.sourceforge.net/sourceforge/tokyocabinet
    pkg=tokyocabinet-1.4.29    ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local && make -j2 && sudo make install ) ;
    pkg=tokyotyrant-1.1.30     ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local && make -j2 && sudo make install ) ;
    pkg=tokyocabinet-ruby-1.27 ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ruby extconf.rb                 && make -j2 && sudo make install ) ;
    pkg=tokyotyrant-ruby-1.10  ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && sudo ruby install.rb ) ;

Why you should, or shouldn’t, use Edamame

**Beanstalkd**’s strengths:

fast
loosely-ordered priority queuing (few others have this)
multiple queues (via ‘tubes’)
reliable: jobs that time out are re-assigned
lightweight
distributed
scalable up to your memory limits
simple implementation
simple wire protocol and thus tons of language libraries

It lacks

persistence
security (via, e.g., signed jobs)
failover / high availability
discovery of new instances

Edamame adds persistence and (with the God scripts) ‘good-enough availability’ (note that it still lacks discovery). It also introduces a bit of complexity and some risk of inconsistent state or duplicated jobs (see above).

Other distributed queues

RabbitMQ, Kestrel and Amazon’s SQS seem to be the best industrial-strength distributed queue systems. Note that they are messaging queues and lack some of the specifically job queuing features of Beanstalk.

RabbitMQ
- fast
- FIFO message queue: poor job support
- ?? NO multiple queues
- ?? NO reliable: jobs that time out are re-assigned
- lightweight
- distributed
- scalable up to your memory limits
- Uses the industrial strength AMPQ protocol
- HA, failover, discovery
- Strong support for Python and . Libraries with weak documentation exist in most other languages.

Kestrel, a reimplementation of Starling
- fast
- Scheduling is loosely-ordered FIFO (no priority)
- multiple queues (via ‘tubes’).
- reliable: jobs that time out are re-assigned
- lightweight
- distributed
- scalable up to your memory limits
- persistent and journaled
- Written in Scala (Java). Uses memcached protocol: more or less perfectly cross-platform.
- Documentation is sparse, though Starling’s and Memcached’s have most of what you need.
- Note that Starling lacks many of these features; Kestrel makes a better job queue and should be a functional replacement
- see: Kestrel on github – Kestrel announcement

Amazon SQS
- Not as fast
- Costs money — equeuing and dequeuing 1M requests costs $2.00 for reqs; if your server is not on AWS then you’ll pay data charges, an additional ($0.27 per kB per million jobs).

This comparison of message queues describes one group’s opinionated survey of the industrial strenght distributed messaging queue ecosystem. Note carefully their criteria; ours were quite different, hence edamame.

Other worker queues

Most of these are heavy-weight job queuing solutions that play nice with Rails:

This talk by Rob Mack on Background Processing in Ruby on Rails (at the April Austin on Rails meeting) has a great overview of job queuing solutions for Rails and in general.

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Add tests for it. This is important so I don’t break it in a future version unintentionally.
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
Send me a pull request. Bonus points for topic branches.

Endnotes

Origin of the name edamame
This library was written to support the Monkeyshines distributed API scraper.
Beanstalk:
- Beanstalk, a fast, distributed, in-memory workqueue service
- Beanstalkd code
- FAQ
- Beanstalk Ruby Client
- Tutorial from nubyonrails
- Mailing list
- Some beanstalk utilities — edamame has its own take on some of these.
Tokyo Tyrant:
- Tokyo Tyrant
- Tokyo Tyrant Ruby libs
- You’ll need the Tokyo Cabinet libs and the Tokyo Cabinet Ruby libs
God process monitoring framework
- http://railscasts.com/episodes/130-monitoring-with-god
- Some code for the god conf is inspired by that railscast, this pastie, the one from the god docs, and Configuring GMail notifiers in God

Alternatives to God include (in order of complexity): Monit, perhaps with Munin; Cacti and Hyperic