Beanstalk + Tokyo Tyrant = Edamame, a fast persistent distributed priority job queue

Edamame combines the Beanstalk priority queue with a Tokyo Tyrant database and God monitoring to produce a persistent distributed priority job queue system.

  • fast, scalable, lightweight and distributed
  • persistent and recoverable
  • scalable up to your memory limits
  • queryable and enumerable jobs
  • named jobs
  • reasonably-good availability.

Like beanstalk, it is a job queue, not just a message queue:

  • priority job scheduling, not just FIFO
  • Supports multiple queues (‘tubes’)
  • reliable scheduling: jobs that time out are re-assigned

It includes a few nifty toys:

  • Scripts for God to monitor and restart the daemons
  • Command-line management scripts to load. enumerate, empty, and show stats for the db+queue
  • The start of a lightweight web frontend in Sinatra.

Hi,

I’ve slapped together the beanstalk distributed priority queue (http://bit.ly/beanstalkd) and the Tokyo Tyrant lightweight database (http://bit.ly/ttyrant – http://bit.ly/ttyrantruby) to create a serviceable persistent distributed priority queue.

If you’re willing to accept its weaknesses, it gives you

  • persistence
  • queryability and enumeration of jobs
  • named jobs

Design

Jobs are persisted to a tokyo tyrant backing store. Beanstalkd

(No failover or discovery, but yes restarting and reloading.)

Caveats

Weaknesses? Mainly that it will make an Erlang’er cry for its lack of concurrency correctness. Its goal is to work pretty well and to recover gracefully, but its design limits .

  • We store jobs in two places: the central DB and the distributed queue.
  • As always, your jobs must either be idempotent, or harmless if re-run: a job could start and do some or all of its job — but lose contact with the queue, causing the job to be re-run. This is inherent in beanstalkd (and most comparable solutions), not just edamame.
  • Although God

TODOs

  • Restarting is still manual: you have to run bin/sync.rb to reload the queue from the database
  • Right now each jobs lives in full in the beanstalkd. This carries a heavy memory cost but a performance gain (the alternative is to do a second query to the DB once the job has been retrieved). I’m going to make both implementations available.
  • The sinatra queue viewer doesn’t work at the moment.

Requirements and Installation

For the beanstalk part, You’ll need libevent >= 1.4, beanstalkd >= 1.3, and beanstalk-client


    cd /usr/local/src/ ;
    pkg=libevent-1.4.11-stable ;( wget -nc http://monkey.org/~provos/${pkg}.tar.gz     && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local                         && make -j2 && sudo make install ) ;
    pkg=beanstalkd-1.3         ;( wget -nc http://xph.us/dist/beanstalkd/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local --with-event=/usr/local && make -j2 && sudo make install ) ;
    sudo gem install --no-ri --no-rdoc dustin-beanstalk-client ;

For the tokyotyrant part, you’ll need tokyocabinet, tokyotyrant and their corresponding ruby libraries:


    cd /usr/local/src/ ;
    ttbase=http://downloads.sourceforge.net/sourceforge/tokyocabinet
    pkg=tokyocabinet-1.4.29    ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local && make -j2 && sudo make install ) ;
    pkg=tokyotyrant-1.1.30     ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ./configure --prefix=/usr/local && make -j2 && sudo make install ) ;
    pkg=tokyocabinet-ruby-1.27 ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && ruby extconf.rb                 && make -j2 && sudo make install ) ;
    pkg=tokyotyrant-ruby-1.10  ;( wget -nc ${ttbase}/${pkg}.tar.gz && tar xvzf ${pkg}.tar.gz && cd $pkg && sudo ruby install.rb ) ;

Why you should, or shouldn’t, use Edamame

**Beanstalkd**’s strengths:

  • fast
  • loosely-ordered priority queuing (few others have this)
  • multiple queues (via ‘tubes’)
  • reliable: jobs that time out are re-assigned
  • lightweight
  • distributed
  • scalable up to your memory limits
  • simple implementation
  • simple wire protocol and thus tons of language libraries

It lacks

  • persistence
  • security (via, e.g., signed jobs)
  • failover / high availability
  • discovery of new instances

Edamame adds persistence and (with the God scripts) ‘good-enough availability’ (note that it still lacks discovery). It also introduces a bit of complexity and some risk of inconsistent state or duplicated jobs (see above).

Other distributed queues

RabbitMQ, Kestrel and Amazon’s SQS seem to be the best industrial-strength distributed queue systems. Note that they are messaging queues and lack some of the specifically job queuing features of Beanstalk.

  • RabbitMQ
    • fast
    • FIFO message queue: poor job support
    • ?? NO multiple queues
    • ?? NO reliable: jobs that time out are re-assigned
    • lightweight
    • distributed
    • scalable up to your memory limits
    • Uses the industrial strength AMPQ protocol
    • HA, failover, discovery
    • Strong support for Python and . Libraries with weak documentation exist in most other languages.
  • Kestrel, a reimplementation of Starling
    • fast
    • Scheduling is loosely-ordered FIFO (no priority)
    • multiple queues (via ‘tubes’).
    • reliable: jobs that time out are re-assigned
    • lightweight
    • distributed
    • scalable up to your memory limits
    • persistent and journaled
    • Written in Scala (Java). Uses memcached protocol: more or less perfectly cross-platform.
    • Documentation is sparse, though Starling’s and Memcached’s have most of what you need.
    • Note that Starling lacks many of these features; Kestrel makes a better job queue and should be a functional replacement
    • see: Kestrel on githubKestrel announcement
  • Amazon SQS
    • Not as fast
    • Costs money — equeuing and dequeuing 1M requests costs $2.00 for reqs; if your server is not on AWS then you’ll pay data charges, an additional ($0.27 per kB per million jobs).

This comparison of message queues describes one group’s opinionated survey of the industrial strenght distributed messaging queue ecosystem. Note carefully their criteria; ours were quite different, hence edamame.

Other worker queues

Most of these are heavy-weight job queuing solutions that play nice with Rails:

This talk by Rob Mack on Background Processing in Ruby on Rails (at the April Austin on Rails meeting) has a great overview of job queuing solutions for Rails and in general.

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don’t break it in a future version unintentionally.
  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
  • Send me a pull request. Bonus points for topic branches.

Endnotes