Chewy

Chewy is ODM and wrapper for official elasticsearch client (https://github.com/elasticsearch/elasticsearch-ruby)

Why chewy?

Multi-model indexes.

Index classes are independent from ORM/ODM models. Now implementing, e.g. cross-model autocomplete is much easier. You can just define index and work with it in object-oriented style. You can define several types for index - one per indexed model.

Every index is observable by all the related models.

Most of the indexed models are related to other and sometimes it is nessesary to denormalize this related data and put at the same object. For example, you need to index array of tags with article together. Chewy allows you to specify updatable index for every model separately. So, corresponding articles will be reindexed on any tag update.

Bulk import everywhere.

Chewy utilizes bulk ES API for full reindexing or index updates. Also it uses atomic updates concept. All the changed objects are collected inside the atomic block and index is updated once at the end of it with all the collected object. See Chewy.strategy(:atomic) for more details.

Powerful querying DSL.

Chewy has AR-style query DSL. It is chainable, mergable and lazy. So you can produce queries in the most efficient way. Also it has object-oriented query and filter builders.

Installation

Add this line to your application's Gemfile:

gem 'chewy'

And then execute:

$ bundle

Or install it yourself as:

$ gem install chewy

Usage

Client settings

There are 2 ways to configure Chewy client: Chewy.settings hash and chewy.yml

You can create this file manually or run rails g chewy:install to do that with yaml way

# config/initializers/chewy.rb
Chewy.settings = {host: 'localhost:9250'} # do not use environments

# config/chewy.yml
# separate environment configs
test:
  host: 'localhost:9250'
  prefix: 'test'
development:
  host: 'localhost:9200'

The result config merges both hashes. Client options are passed as is to Elasticsearch::Transport::Client except the :prefix - it is used internally by chewy to create prefixed index names:

  Chewy.settings = {prefix: 'test'}
  UsersIndex.index_name # => 'test_users'

Also logger might be set explicitly:

Chewy.logger = Logger.new

See config.rb for more details.

Index definition

Create /app/chewy/users_index.rb

  class UsersIndex < Chewy::Index

  end

Add one or more types mapping

  class UsersIndex < Chewy::Index
    define_type User.active # or just model instead_of scope: define_type User
  end

Newly-defined index type class is accessible via UsersIndex.user or UsersIndex::User

Add some type mappings

  class UsersIndex < Chewy::Index
    define_type User.active.includes(:country, :badges, :projects) do
      field :first_name, :last_name # multiple fields without additional options
      field :email, analyzer: 'email' # elasticsearch-related options
      field :country, value: ->(user) { user.country.name } # custom value proc
      field :badges, value: ->(user) { user.badges.map(&:name) } # passing array values to index
      field :projects do # the same block syntax for multi_field, if `:type` is specified
        field :title
        field :description # default data type is `string`
        # additional top-level objects passed to value proc:
        field :categories, value: ->(project, user) { project.categories.map(&:name) if user.active? }
      end
      field :rating, type: 'integer' # custom data type
      field :created, type: 'date', include_in_all: false,
        value: ->{ created_at } # value proc for source object context
    end
  end

Mapping definitions - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html

Add some index- and type-related settings. Analyzers repositories might be used as well. See Chewy::Index.settings docs for details:

  class UsersIndex < Chewy::Index
    settings analysis: {
      analyzer: {
        email: {
          tokenizer: 'keyword',
          filter: ['lowercase']
        }
      }
    }

    define_type User.active.includes(:country, :badges, :projects) do
      root date_detection: false do
        template 'about_translations.*', type: 'string', analyzer: 'standard'

        field :first_name, :last_name
        field :email, analyzer: 'email'
        field :country, value: ->(user) { user.country.name }
        field :badges, value: ->(user) { user.badges.map(&:name) }
        field :projects do
          field :title
          field :description
        end
        field :about_translations, type: 'object' # pass object type explicitely if necessary
        field :rating, type: 'integer'
        field :created, type: 'date', include_in_all: false,
          value: ->{ created_at }
      end
    end
  end

Index settings - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html Root object settings - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-root-object-type.html

See mapping.rb for more details.

Add model observing code

  class User < ActiveRecord::Base
    update_index('users#user') { self } # specifying index, type and backreference
                                        # for updating after user save or destroy
  end

  class Country < ActiveRecord::Base
    has_many :users

    update_index('users#user') { users } # return single object or collection
  end

  class Project < ActiveRecord::Base
    update_index('users#user') { user if user.active? } # you can return even `nil` from the backreference
  end

  class Badge < ActiveRecord::Base
    has_and_belongs_to_many :users

    update_index('users') { users } # if index has only one type
                                    # there is no need to specify updated type
  end

Also, you can use second argument for method name passing:

  update_index('users#user', :self)
  update_index('users#user', :users)

Types access

You are able to access index-defined types with the following API:

UsersIndex::User # => UsersIndex::User
UsersIndex.type_hash['user'] # => UsersIndex::User
UsersIndex.user # => UsersIndex::User
UsersIndex.types # => [UsersIndex::User]
UsersIndex.type_names # => ['user']

Index manipulation

UsersIndex.delete # destroy index if exists
UsersIndex.delete!

UsersIndex.create
UsersIndex.create! # use bang or non-bang methods

UsersIndex.purge
UsersIndex.purge! # deletes then creates index

UsersIndex::User.import # import with 0 arguments process all the data specified in type definition
                        # literally, User.active.includes(:country, :badges, :projects).find_in_batches
UsersIndex::User.import User.where('rating > 100') # or import specified users scope
UsersIndex::User.import User.where('rating > 100').to_a # or import specified users array
UsersIndex::User.import [1, 2, 42] # pass even ids for import, it will be handled in the most effective way

UsersIndex.import # import every defined type
UsersIndex.import user: User.where('rating > 100') # import only active users to `user` type.
  # Other index types, if exists, will be imported with default scope from the type definition.
UsersIndex.reset! # purges index and imports default data for all types

Also if passed user is #destroyed?, or satisfy delete_if type option, or specified id does not exists in the database, import will perform delete from index action for this object.

define_type User, delete_if: :deleted_at
define_type User, delete_if: -> { deleted_at }
define_type User, delete_if: ->(user) { user.deleted_at }

See actions.rb for more details.

Index update strategies

Assume you've got the following code:

class City < ActiveRecord::Base
  update_index 'cities#city', :self
end

class CitiesIndex < Chewy::Index
  define_type City do
    field :name
  end
end

If you'll perform something like City.first.save! you'll get UndefinedUpdateStrategy exception instead of normal object saving and index update. This exception forces you to choose appropriate update strategy for current context.

If you want to return behavior was before 0.7.0 - just set Chewy.root_strategy = :bypass.

`:atomic`

The main strategy here is :atomic. Assume you have to update a lot of records in db.

Chewy.strategy(:atomic) do
  City.popular.map(&:do_some_update_action!)
end

Using this strategy delays index update request until the end of block. Updated records are aggregated and index update happens with bulk API. So this strategy is highly optimized.

`:urgent`

Next strategy is convenient if you are going to update documents in index one-by-one.

Chewy.strategy(:urgent) do
  City.popular.map(&:do_some_update_action!)
end

This code would perform City.popular.count requests for ES documents update.

Seems to be convenient for usage in e.g. rails console with non-block notation:

> Chewy.strategy(:urgent)
> City.popular.map(&:do_some_update_action!)

`:bypass`

Bypass strategy simply silences index updates.

Nesting

Strategies are designed to allow nesting, so it is possible to redefine it for nested contexts.

Chewy.strategy(:atomic) do
  city1.do_update!
  Chewy.strategy(:urgent) do
    city2.do_update!
    city3.do_update!
    # there will be 2 update index requests for city2 and city3
  end
  city4..do_update!
  # city1 and city4 will be grouped in one index update request
end

Non-block notation

It is possible to nest strategies without blocks:

Chewy.strategy(:urgent)
city1.do_update! # index updated
Chewy.strategy(:bypass)
city2.do_update! # update bypassed
Chewy.strategy.pop
city3.do_update! # index updated again

Designing own strategies

Async strategy is not implemented yet, but it is planned. So it would be a good idea to implements own async strategy for particular delayed jobs library or simply threads.

See strategy/base.rb for more details. See strategy/atomic.rb for example.

Async reindexing

Chewy is not support async index update, but it's planned. Until you can use third-party solutions, such as https://github.com/averell23/chewy_kiqqer

Not sure it works currently.

Rails application strategies integration

There is a couple of pre-defined strategies for your rails application. At first, rails console uses :urgent strategy by default, except the sandbox case. Whan you are running sandbox it switches to bypass strategy to avoid index polluting.

Also migrations are wrapped with :bypass strategy. Because the main behavor implies that indexes are resetted after migration, so there is no need for extra index updates. Also indexing might be broken during migrations because of the outdated schema.

Controller actions are wrapped with :atomic strategy with middleware just to reduce amount of index update requests inside actions.

It is also a good idea to set up :bypass strategy inside your test suite and import objects manually only when needed, plus use Chewy.massacre when needed to flush test ES indexes before every example. This will allow to minimize unnecessary ES requests and reduce overhead.

RSpec.configure do |config|
  config.before(:suite) do
    Chewy.strategy(:bypass)
  end
end

Index querying

scope = UsersIndex.query(term: {name: 'foo'})
  .filter(range: {rating: {gte: 100}})
  .order(created: :desc)
  .limit(20).offset(100)

scope.to_a # => will produce array of UserIndex::User or other types instances
scope.map { |user| user.email }
scope.total_count # => will return total objects count

scope.per(10).page(3) # supports kaminari pagination
scope.explain.map { |user| user._explanation }
scope.only(:id, :email) # returns ids and emails only

scope.merge(other_scope) # queries could be merged

Also, queries can be performed on a type individually

UsersIndex::User.filter(term: {name: 'foo'}) # will return UserIndex::User collection only

If you are performing more than one filter or query in the chain, all the filters and queries will be concatenated in the way specified by filter_mode and query_mode respectively.

Default filter_mode is :and and default query_mode is bool.

Available filter modes are: :and, :or, :must, :should and any minimum_should_match-acceptable value

Available query modes are: :must, :should, :dis_max, any minimum_should_match-acceptable value or float value for dis_max query with tie_breaker specified.

UsersIndex::User.filter{ name == 'Fred' }.filter{ age < 42 } # will be wrapped with `and` filter
UsersIndex::User.filter{ name == 'Fred' }.filter{ age < 42 }.filter_mode(:should) # will be wrapped with bool `should` filter
UsersIndex::User.filter{ name == 'Fred' }.filter{ age < 42 }.filter_mode('75%') # will be wrapped with bool `should` filter with `minimum_should_match: '75%'`

See query.rb for more details.

Additional query action.

You may also perform additional actions on query scope, such as deleting of all the scope documents:

UsersIndex.delete_all
UsersIndex::User.delete_all
UsersIndex.filter{ age < 42 }.delete_all
UsersIndex::User.filter{ age < 42 }.delete_all

Filters query DSL.

There is a test version of filters creating DSL:

UsersIndex.filter{ name == 'Fred' } # will produce `term` filter.
UsersIndex.filter{ age <= 42 } # will produce `range` filter.

The basis of the DSL is expression. There are 2 types of expressions:

Simple function

  UsersIndex.filter{ s('doc["num"] > 1') } # script expression
  UsersIndex.filter{ q(query_string: {query: 'lazy fox'}) } # query expression

Field-dependant composite expression. Consists of the field name (with dot notation or not), value and action operator between them. Field name might take additional options for passing to the result expression.

  UsersIndex.filter{ name == 'Name' } # simple field term filter
  UsersIndex.filter{ name(:bool) == ['Name1', 'Name2'] } # terms query with `execution: :bool` option passed
  UsersIndex.filter{ answers.title =~ /regexp/ } # regexp filter for `answers.title` field

You can combine expressions as you wish with combination operators help

UsersIndex.filter{ (name == 'Name') & (email == 'Email') } # combination produces `and` filter
UsersIndex.filter{
  must(
    should(name =~ 'Fr').should_not(name == 'Fred') & (age == 42), email =~ /gmail\.com/
  ) | ((roles.admin == true) & name?)
} # many of the combination possibilities

Also, there is a special syntax for cache enabling:

UsersIndex.filter{ ~name == 'Name' } # you can apply tilda to the field name
UsersIndex.filter{ ~(name == 'Name') } # or to the whole expression

# if you are applying cache to the one part of range filter
# the whole filter will be cached:
UsersIndex.filter{ ~(age > 42) & (age <= 50) }

# You can pass cache options as a field option also.
UsersIndex.filter{ name(cache: true) == 'Name' }
UsersIndex.filter{ name(cache: false) == 'Name' }

# With regexp filter you can pass _cache_key
UsersIndex.filter{ name(cache: 'name_regexp') =~ /Name/ }
# Or not
UsersIndex.filter{ name(cache: true) =~ /Name/ }

Compliance cheatsheet for filters and DSL expressions:

Term filter

  {"term": {"name": "Fred"}}
  {"not": {"term": {"name": "Johny"}}}

  UsersIndex.filter{ name == 'Fred' }
  UsersIndex.filter{ name != 'Johny' }

Terms filter

  {"terms": {"name": ["Fred", "Johny"]}}
  {"not": {"terms": {"name": ["Fred", "Johny"]}}}

  {"terms": {"name": ["Fred", "Johny"], "execution": "or"}}

  {"terms": {"name": ["Fred", "Johny"], "execution": "and"}}

  {"terms": {"name": ["Fred", "Johny"], "execution": "bool"}}

  {"terms": {"name": ["Fred", "Johny"], "execution": "fielddata"}}

  UsersIndex.filter{ name == ['Fred', 'Johny'] }
  UsersIndex.filter{ name != ['Fred', 'Johny'] }

  UsersIndex.filter{ name(:|) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(:or) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(execution: :or) == ['Fred', 'Johny'] }

  UsersIndex.filter{ name(:&) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(:and) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(execution: :and) == ['Fred', 'Johny'] }

  UsersIndex.filter{ name(:b) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(:bool) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(execution: :bool) == ['Fred', 'Johny'] }

  UsersIndex.filter{ name(:f) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(:fielddata) == ['Fred', 'Johny'] }
  UsersIndex.filter{ name(execution: :fielddata) == ['Fred', 'Johny'] }

Regexp filter (== and =~ are equivalent)

  {"regexp": {"name.first": "s.*y"}}

  {"not": {"regexp": {"name.first": "s.*y"}}}

  {"regexp": {"name.first": {"value": "s.*y", "flags": "ANYSTRING|INTERSECTION"}}}

  UsersIndex.filter{ name.first == /s.*y/ }
  UsersIndex.filter{ name.first =~ /s.*y/ }

  UsersIndex.filter{ name.first != /s.*y/ }
  UsersIndex.filter{ name.first !~ /s.*y/ }

  UsersIndex.filter{ name.first(:anystring, :intersection) == /s.*y/ }
  UsersIndex.filter{ name.first(flags: [:anystring, :intersection]) == /s.*y/ }

Prefix filter

  {"prefix": {"name": "Fre"}}
  {"not": {"prefix": {"name": "Joh"}}}

  UsersIndex.filter{ name =~ re' }
  UsersIndex.filter{ name !~ 'Joh' }

Exists filter

  {"exists": {"field": "name"}}

  UsersIndex.filter{ name? }
  UsersIndex.filter{ !!name }
  UsersIndex.filter{ !!name? }
  UsersIndex.filter{ name != nil }
  UsersIndex.filter{ !(name == nil) }

Missing filter

  {"missing": {"field": "name", "existence": true, "null_value": false}}
  {"missing": {"field": "name", "existence": true, "null_value": true}}
  {"missing": {"field": "name", "existence": false, "null_value": true}}

  UsersIndex.filter{ !name }
  UsersIndex.filter{ !name? }
  UsersIndex.filter{ name == nil }

Range

  {"range": {"age": {"gt": 42}}}
  {"range": {"age": {"gte": 42}}}
  {"range": {"age": {"lt": 42}}}
  {"range": {"age": {"lte": 42}}}

  {"range": {"age": {"gt": 40, "lt": 50}}}
  {"range": {"age": {"gte": 40, "lte": 50}}}

  {"range": {"age": {"gt": 40, "lte": 50}}}
  {"range": {"age": {"gte": 40, "lt": 50}}}

  UsersIndex.filter{ age > 42 }
  UsersIndex.filter{ age >= 42 }
  UsersIndex.filter{ age < 42 }
  UsersIndex.filter{ age <= 42 }

  UsersIndex.filter{ age == (40..50) }
  UsersIndex.filter{ (age > 40) & (age < 50) }
  UsersIndex.filter{ age == [40..50] }
  UsersIndex.filter{ (age >= 40) & (age <= 50) }

  UsersIndex.filter{ (age > 40) & (age <= 50) }
  UsersIndex.filter{ (age >= 40) & (age < 50) }

Bool filter

  {"bool": {
    "must": [{"term": {"name": "Name"}}],
    "should": [{"term": {"age": 42}}, {"term": {"age": 45}}]
  }}

  UsersIndex.filter{ must(name == 'Name').should(age == 42, age == 45) }

And filter

  {"and": [{"term": {"name": "Name"}}, {"range": {"age": {"lt": 42}}}]}

  UsersIndex.filter{ (name == 'Name') & (age < 42) }

Or filter

  {"or": [{"term": {"name": "Name"}}, {"range": {"age": {"lt": 42}}}]}

  UsersIndex.filter{ (name == 'Name') | (age < 42) }

  {"not": {"term": {"name": "Name"}}}
  {"not": {"range": {"age": {"lt": 42}}}}

  UsersIndex.filter{ !(name == 'Name') } # or UsersIndex.filter{ name != 'Name' }
  UsersIndex.filter{ !(age < 42) }

Match all filter

  {"match_all": {}}

  UsersIndex.filter{ match_all }

Has child filter

  {"has_child": {"type": "blog_tag", "query": {"term": {"tag": "something"}}}
  {"has_child": {"type": "comment", "term": {"term": {"user": "john"}}}

  UsersIndex.filter{ has_child(:blog_tag).query(term: {tag: 'something'}) }
  UsersIndex.filter{ has_child(:comment).filter{ user == 'john' } }

Has parent filter

  {"has_parent": {"type": "blog", "query": {"term": {"tag": "something"}}}}
  {"has_parent": {"type": "blog", "filter": {"term": {"text": "bonsai three"}}}}

  UsersIndex.filter{ has_parent(:blog).query(term: {tag: 'something'}) }
  UsersIndex.filter{ has_parent(:blog).filter{ text == 'bonsai three' } }

See filters.rb for more details.

Faceting

Facets are an optional sidechannel you can request from elasticsearch describing certain fields of the resulting collection. The most common use for facets is to allow the user continue filtering specifically within the subset, as opposed to the global index.

For instance, let's request the country field as a facet along with our users collection. We can do this with the #facets method like so:

UsersIndex.filter{ [...] }.facets({countries: {terms: {field: 'country'}}})

Let's look at what we asked from elasticsearch. The facets setter method accepts a hash. You can choose custom/semantic key names for this hash for your own convinience (in this case I used the plural version of the actual field), in our case: countries. The following nested hash tells ES to grab and aggregate values (terms) from the country field on our indexed records.

When the response comes back, it will have the :facets sidechannel included:

< { ... ,"facets":{"countries":{"_type":"terms","missing":?,"total":?,"other":?,"terms":[{"term":"USA","count":?},{"term":"Brazil","count":?}, ...}}

Script scoring

Script scoring is used to score the search results. All scores are added to the search request and combined according to boost mode and score mode. This can be useful if, for example, a score function is computationally expensive and it is sufficient to compute the score on a filtered set of documents. For example, you might want to multiply the score by another numeric field in the doc:

UsersIndex.script_score("_score * doc['my_numeric_field'].value")

Boost Factor

Boost factors are a way to add a boost to a query where documents match the filter. If you have some users who are experts and some are regular users, you might want to give the experts a higher score and boost to the top of the search results. You can accomplish this by using the #boost_factor method and adding a boost score for an expert user of 5:

UsersIndex.boost_factor(5, filter: {term: {type: 'Expert'}})

Objects loading

It is possible to load source objects from database for every search result:

scope = UsersIndex.filter(range: {rating: {gte: 100}})

scope.load # => scope is marked to return User instances array
scope.load.query(...) # => since objects are loaded lazily you can complete scope
scope.load(user: { scope: ->{ includes(:country) }}) # you can also pass loading scopes for each
                                                     # possibly returned type
scope.load(user: { scope: User.includes(:country) }) # the second scope passing way.
scope.load(scope: ->{ includes(:country) }) # and more common scope applied to every loaded object type.

scope.only(:id).load # it is optimal to request ids only if you are not planning to use type objects

The preload method takes the same options as load and ORM/ODM objects will be loaded, but scope will still return array of Chewy wrappers. To access real objects use _object wrapper method:

UsersIndex.filter(range: {rating: {gte: 100}}).preload(...).query(...).map(&:_object)

See loading.rb for more details.

`ActiveSupport::Notifications` support

Chewy has notifing the following events:

`search_query.chewy` payload

payload[:index]: requested index class
payload[:request]: request hash

`import_objects.chewy` payload

payload[:type]: currently imported type
payload[:import]: imports stast, total imported and deleted objects count:
```
{index: 30, delete: 5}
```

payload[:errors]: might not exists. Contains grouped errors with objects ids list:

{index: {
  'error 1 text' => ['1', '2', '3'],
  'error 2 text' => ['4']
}, delete: {
  'delete error text' => ['10', '12']
}}

NewRelic integration

To integrate with NewRelic you may use the following example source (config/initializers/chewy.rb):

ActiveSupport::Notifications.subscribe('import_objects.chewy') do |name, start, finish, id, payload|
  metric_name = "Database/ElasticSearch/import"
  duration = (finish - start).to_f
  logged = "#{payload[:type]} #{payload[:import].to_a.map{ |i| i.join(':') }.join(', ')}"

  self.class.trace_execution_scoped([metric_name]) do
    NewRelic::Agent.instance.transaction_sampler.notice_sql(logged, nil, duration)
    NewRelic::Agent.instance.sql_sampler.notice_sql(logged, metric_name, nil, duration)
    NewRelic::Agent.record_metric(metric_name, duration)
  end
end

ActiveSupport::Notifications.subscribe('search_query.chewy') do |name, start, finish, id, payload|
  metric_name = "Database/ElasticSearch/search"
  duration = (finish - start).to_f
  logged = "#{payload[:type].presence || payload[:index]} #{payload[:request]}"

  self.class.trace_execution_scoped([metric_name]) do
    NewRelic::Agent.instance.transaction_sampler.notice_sql(logged, nil, duration)
    NewRelic::Agent.instance.sql_sampler.notice_sql(logged, metric_name, nil, duration)
    NewRelic::Agent.record_metric(metric_name, duration)
  end
end

Rake tasks

Inside Rails application some index mantaining rake tasks are defined.

rake chewy:reset:all # resets all the existing indexes, declared in app/chewy
rake chewy:reset # alias for chewy:reset:all
rake chewy:reset[users] # resets UsersIndex

rake chewy:update:all # updates all the existing indexes, declared in app/chewy
rake chewy:update # alias for chewy:update:all
rake chewy:update[users] # updates UsersIndex

Rspec integration

Just add require 'chewy/rspec' to your spec_helper.rb and you will get additional features: See update_index.rb for more details.

TODO a.k.a coming soon:

Typecasting support
Advanced (simplyfied) query DSL: UsersIndex.query { email == '[email protected]' } will produce term query
update_all support
Maybe, closer ORM/ODM integration, creating index classes implicitly
Async indexes updating

Contributing

Fork it ( http://github.com/toptal/chewy/fork )
Create your feature branch (git checkout -b my-new-feature)
Implement your changes, cover it with specs and make sure old specs are passing
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

Use the following Rake tasks to control ElasticSearch cluster while developing.

rake elasticsearch:start # start Elasticsearch cluster on 9250 port for tests
rake elasticsearch:stop # stop Elasticsearch