Slingshot

Slingshot

Slingshot is a Ruby client for the ElasticSearch search engine/database.

ElasticSearch is a scalable, distributed, cloud-ready, highly-available, full-text search engine and database, communicating by JSON over RESTful HTTP, based on Lucene, written in Java.

This document provides just a brief overview of Slingshot's features. Be sure to check out also the extensive documentation at http://karmi.github.com/slingshot/ if you're interested.

Installation

First, you need a running ElasticSearch server. Thankfully, it's easy. Let's define easy:

$ curl -k -L -o elasticsearch-0.16.0.tar.gz http://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.16.0.tar.gz
$ tar -zxvf elasticsearch-0.16.0.tar.gz
$ ./elasticsearch-0.16.0/bin/elasticsearch -f

OK. Easy. On a Mac, you can also use Homebrew:

$ brew install elasticsearch

OK. Let's install the gem via Rubygems:

$ gem install slingshot-rb

Of course, you can install it from the source as well:

$ git clone git://github.com/karmi/slingshot.git
$ cd slingshot
$ rake install

Usage

Slingshot exposes easy-to-use domain specific language for fluent communication with ElasticSearch.

It also blends with your ActiveModel classes for convenient usage in Rails applications.

To test-drive the core ElasticSearch functionality, let's require the gem:

require 'rubygems'
require 'slingshot'

Please note that you can copy these snippets from the much more extensive and heavily annotated file in examples/slingshot-dsl.rb.

OK. Let's create an index named articles and store/index some documents:

Slingshot.index 'articles' do
  delete
  create

  store :title => 'One',   :tags => ['ruby']
  store :title => 'Two',   :tags => ['ruby', 'python']
  store :title => 'Three', :tags => ['java']
  store :title => 'Four',  :tags => ['ruby', 'php']

  refresh
end

We can also create the index with custom mapping for a specific document type:

Slingshot.index 'articles' do
  create :mappings => {
    :article => {
      :properties => {
        :id       => { :type => 'string', :index => 'not_analyzed', :include_in_all => false },
        :title    => { :type => 'string', :boost => 2.0,            :analyzer => 'snowball'  },
        :tags     => { :type => 'string', :analyzer => 'keyword'                             },
        :content  => { :type => 'string', :analyzer => 'snowball'                            }
      }
    }
  }
end

Of course, we may have large amounts of data, and it may be impossible or impractical to add them to the index one by one. We can use ElasticSearch's bulk storage:

articles = [
  { :id => '1', :title => 'one'   },
  { :id => '2', :title => 'two'   },
  { :id => '3', :title => 'three' }
]

Slingshot.index 'bulk' do
  import articles
end

We can also easily manipulate the documents before storing them in the index, by passing a block to the import method:

Slingshot.index 'bulk' do
  import articles do |documents|

    documents.each { |document| document[:title].capitalize! }
  end
end

OK. Now, let's go search all the data.

We will be searching for articles whose title begins with letter “T”, sorted by title in descending order, filtering them for ones tagged “ruby”, and also retrieving some facets from the database:

s = Slingshot.search 'articles' do
  query do
    string 'title:T*'
  end

  filter :terms, :tags => ['ruby']

  sort { title 'desc' }

  facet 'global-tags' do
    terms :tags, :global => true
  end

  facet 'current-tags' do
    terms :tags
  end
end

(Of course, we may also page the results with from and size query options, retrieve only specific fields or highlight content matching our query, etc.)

Let's display the results:

s.results.each do |document|
  puts "* #{ document.title } [tags: #{document.tags.join(', ')}]"
end

# * Two [tags: ruby, python]

Let's display the global facets (distribution of tags across the whole database):

s.results.facets['global-tags']['terms'].each do |f|
  puts "#{f['term'].ljust(10)} #{f['count']}"
end

# ruby       3
# python     1
# php        1
# java       1

Now, let's display the facets based on current query (notice that count for articles tagged with 'java' is included, even though it's not returned by our query; count for articles tagged 'php' is excluded, since they don't match the current query):

s.results.facets['current-tags']['terms'].each do |f|
  puts "#{f['term'].ljust(10)} #{f['count']}"
end

# ruby       1
# python     1
# java       1

If configuring the search payload with a block somehow feels too weak for you, you can simply pass a Ruby Hash (or JSON string) with the query declaration to the search method:

Slingshot.search 'articles', :query => { :fuzzy => { :title => 'Sour' } }

If this sounds like a great idea to you, you are probably able to write your application using just curl, sed and awk.

We can display the full query JSON for close inspection:

puts s.to_json
# {"facets":{"current-tags":{"terms":{"field":"tags"}},"global-tags":{"global":true,"terms":{"field":"tags"}}},"query":{"query_string":{"query":"title:T*"}},"filter":{"terms":{"tags":["ruby"]}},"sort":[{"title":"desc"}]}

Or, better, we can display the corresponding curl command to recreate and debug the request in the terminal:

puts s.to_curl
# curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d '{"facets":{"current-tags":{"terms":{"field":"tags"}},"global-tags":{"global":true,"terms":{"field":"tags"}}},"query":{"query_string":{"query":"title:T*"}},"filter":{"terms":{"tags":["ruby"]}},"sort":[{"title":"desc"}]}'

However, we can simply log every search query (and other requests) in this curl-friendly format:

Slingshot.configure { logger 'elasticsearch.log' }

When you set the log level to debug:

Slingshot.configure { logger 'elasticsearch.log', :level => 'debug' }

the JSON responses are logged as well. This is not a great idea for production environment, but it's priceless when you want to paste a complicated transaction to the mailing list or IRC channel.

The Slingshot DSL tries hard to provide a strong Ruby-like API for the main ElasticSearch features.

By default, Slingshot wraps the results collection in a enumerable Results::Collection class, and result items in a Results::Item class, which looks like a child of Hash and Openstruct, for smooth iterating and displaying the results.

You may wrap the result items in your own class by setting the Slingshot.configuration.wrapper property. Your class must take a Hash of attributes on initialization.

If that seems like a great idea to you, there's a big chance you already have such class, and one would bet it's an ActiveRecord or ActiveModel class, containing model of your Rails application.

Fortunately, Slingshot makes blending ElasticSearch features into your models trivially possible.

ActiveModel Integration

Let's suppose you have an Article class in your Rails application. To make it searchable with Slingshot, you just include it:

class Article < ActiveRecord::Base
  include Slingshot::Model::Search
  include Slingshot::Model::Callbacks
end

When you now save a record:

Article.create :title =>   "I Love ElasticSearch",
               :content => "...",
               :author =>  "Captain Nemo",
               :published_on => Time.now

it is automatically added into the index, because of the included callbacks. The document attributes are indexed exactly as when you call the Article#to_json method.

Now you can search the records:

Article.search 'love'

OK. Often, this is where the game stops. Not here.

First of all, you may use the full query DSL, as explained above, with filters, sorting, advanced facet aggregation, highlighting, etc:

q = 'love'
Article.search do
  query { string q }
  facet('timeline') { date :published_on, :interval => 'month' }
  sort  { published_on 'desc' }
end

Dynamic mapping is a godsend when you're prototyping. For serious usage, though, you'll definitely want to define a custom mapping for your model:

class Article < ActiveRecord::Base
  include Slingshot::Model::Search
  include Slingshot::Model::Callbacks

  mapping do
    indexes :id,           :type => 'string',  :analyzed => false
    indexes :title,        :type => 'string',  :analyzer => 'snowball', :boost => 100
    indexes :content,      :type => 'string',  :analyzer => 'snowball'
    indexes :author,       :type => 'string',  :analyzer => 'keyword'
    indexes :published_on, :type => 'date',    :include_in_all => false
  end
end

In this case, only the defined model attributes are indexed when adding to the index.

When you want tight grip on how your model attributes are added to the index, just provide the to_indexed_json method yourself:

class Article < ActiveRecord::Base
  include Slingshot::Model::Search
  include Slingshot::Model::Callbacks

  def to_indexed_json
    names      = author.split(/\W/)
    last_name  = names.pop
    first_name = names.join

    {
      :title   => title,
      :content => content,
      :author  => {
        :first_name => first_name,
        :last_name  => last_name
      }
    }.to_json
  end

end

Note that Slingshot-enhanced models are fully compatible with will_paginate, so you can pass any parameters to the search method in the controller, as usual:

@articles = Article.search params[:q], :page => (params[:page] || 1)

OK. Chances are, you have lots of records stored in the underlying database. How will you get them to ElasticSearch? Easy:

Article.index.import Article.all

However, this way, all your records are loaded into memory, serialized into JSON, and sent down the wire to ElasticSearch. Not practical, you say? You're right.

Provided your model implements some sort of pagination — and it probably does, for so much data —, you can just run:

Article.import

In this case, the Article.paginate method is called, and your records are sent to the index in chunks of 1000. If that number doesn't suit you, just provide a better one:

Article.import :per_page => 100

Any other parameters you provide to the import method are passed down to the paginate method.

Are we saying you have to fiddle with this thing in a rails console or silly Ruby scripts? No. Just call the included Rake task on the commandline:

$ rake environment slingshot:import CLASS='Article'

You can also force-import the data by deleting the index first (and creating it with mapping provided by the mapping block in your model):

$ rake environment slingshot:import CLASS='Article' FORCE=true

When you'll spend more time with ElasticSearch, you'll notice how index aliases are the best idea since the invention of inverted index. You can index your data into a fresh index (and possibly update an alias if everything's fine):

$ rake environment slingshot:import CLASS='Article' INDEX='articles-2011-05'

If you're the type who has no time for long introductions, you can generate a fully working example Rails application, with an ActiveRecord model and a search form, to play with:

$ rails new searchapp -m https://github.com/karmi/slingshot/raw/master/examples/rails-application-template.rb

OK. All this time we have been talking about ActiveRecord models, since it is a reasonable Rails' default for the storage layer.

But what if you use another database such as MongoDB, another object mapping library, such as Mongoid?

Well, things stay mostly the same:

class Article
  include Mongoid::Document
  field :title, :type => String
  field :content, :type => String

  include Slingshot::Model::Search
  include Slingshot::Model::Callbacks

  # Let's use a different index name so stuff doesn't get mixed up
  #
  index_name 'mongo-articles'

  # These Mongo guys sure do some funky stuff with their IDs
  # in +serializable_hash+, let's fix it.
  #
  def to_indexed_json
    self.to_json
  end

end

Article.create :title => 'I Love ElasticSearch'

Article.search 'love'

That's kinda nice. But there's more.

Slingshot implements not only searchable features, but also persistence features.

This means that you can use a Slingshot model instead of your database, not just for searching your database. Why would you like to do that?

Well, because you're tired of database migrations and lots of hand-holding with your database to store stuff like { :name => 'Slingshot', :tags => [ 'ruby', 'search' ] }. Because what you need is to just dump a JSON-representation of your data into a database and load it back when needed. Because you've noticed that searching your data is a much more effective way of retrieval then constructing elaborate database query conditions. Because you have lots of data and want to use ElasticSearch's advanced distributed features.

To use the persistence features, you have to include the Slingshot::Persistence module in your class and define the properties (analogous to the way you do with CouchDB- or MongoDB-based models):

class Article
  include Slingshot::Model::Persistence
  include Slingshot::Model::Search
  include Slingshot::Model::Callbacks

  validates_presence_of :title, :author

  property :title
  property :author
  property :content
  property :published_on

end

Of course, not all validations or ActionPack helpers will be available to your models, but if you can live with that, you've just got a schema-free, highly-scalable storage and retrieval engine for your data.

Todo, Plans & Ideas

Slingshot is already used in production by its authors. Nevertheless, it's not considered finished yet.

There are todos, plans and ideas, some of which are listed below, in the order of importance:

Other Clients

Check out other ElasticSearch clients.

Feedback

You can send feedback via e-mail or via Github Issues.


Karel Minarik