NAME

mongoid-haystack.rb

DESCRIPTION

mongoid-haystack provides a zero-config, POLS, pure mongo, fulltext search solution for your mongoid models.

INSTALL

rubygems: gem intstall 'mongoid-haystack'

Gemfile: gem 'mongoid-haystack'

rake db:mongoid:create_indexes # IMPORTANT


    # you might want this in lib/tasks/db.rake ...
    #

      namespace :db do
        namespace :mongoid do
          task :create_indexes do
            Mongoid::Haystack.create_indexes
          end
        end
      end

SYNOPSIS


  # simple usage is simple
  #
    class Article
      include Mongoid::Document
      include Mongoid::Haystack

      field(:content, :type => String)
    end

    Article.create!(:content => 'teh cats')

    results = Article.search('cat')

    article = results.first.model

  # by default 'search' returns a Mongoid::Criteria object.  the result set will
  # be full of objects that refer to a model in your app via a polymorphic
  # relation out.  aka
  #
  #   Article.search('foobar').first.class       #=> Mongoid::Haystack::Index
  #   Article.search('foobar').first.model.class #=> Article
  #
  # in an index view you are not going to want to expand the search index
  # objects into full blown models one at the time (N+1) so you can use the
  # 'models' method on the collection to effciently expand the collection into
  # your application models with the fewest possible queries.  note that
  # 'models' is a terminal operator.  that is to say it returns an array and,
  # afterwards, no more fancy query language is gonna work.
  #
    @results =
      Mongoid::Haystack.search('needle').models

  # pagination is supported *out of the box*.  note that you should chain it
  # *b4* any call to 'models' as 'models' is a terminal operator: it returns
  # an array and *not* a Mongoid::Criteria object
  #
    @models = 
      Mongoid::Haystack.search('needle').
        paginate(:page => 3, :size => 42).
          models


  # haystack stems the search terms and does score based sorting all using a
  # fast b-tree 
  #
    a = Article.create!(:content => 'cats are awesome')
    b = Article.create!(:content => 'dogs eat cats')
    c = Article.create!(:content => 'dogs dogs dogs')

    results = Article.search('dogs cats').models
    results == [b, a, c] #=> true

    results = Article.search('awesome').models
    results == [a] #=> true


  # cross model searching (site search)is supported out of the box, and models
  # can customise how they are indexed:
  #
  # - a global score lets some models appear hight in the global results
  #
  # - keywords count more than fulltext 
  #
    class Article
      include Mongoid::Document
      include Mongoid::Haystack

      field(:title, :type => String)
      field(:content, :type => String)

      def to_haystack
        { :score => 11, :keywords => title, :fulltext => content }
      end
    end

    class Comment
      include Mongoid::Document
      include Mongoid::Haystack

      field(:content, :type => String)

      def to_haystack
        { :score => -11, :fulltext => content }
      end
    end

    a1 = Article.create!(:title => 'hot pants', :content => 'teh b 52s rock')
    a2 = Article.create!(:title => 'boring title', :content => 'but hot content that rocks')

    c = Comment.create!(:content => 'those guys rock')

    results = Mongoid::Haystack.search('rock')
    results.count #=> 3

    models = results.models
    models == [a1, a2, c]  #=> true. articles first beause we generally score them higher

    results = Mongoid::Haystack.search('hot')
    models = results.models
    models == [a1, a2]  #=> true. because keywords score highter than general fulltext


  # you can decorate your search items with arbirtrary meta data and filter
  # searches by it later.  this too uses a speedy b-tree index.
  #
    class Article
      include Mongoid::Document
      include Mongoid::Haystack

      belongs_to :author, :class_name => '::User'

      field(:title, :type => String)
      field(:content, :type => String)

      def to_haystack
        { 
          :score    => author.popularity,
          :keywords => title,
          :fulltext => content,
          :facets   => {:author_id => author.id}
        }
      end
    end

    a = 
      author.articles.create!(
        :title => 'iggy and keith',
        :content => 'seen the needles and the damage done...'
      )

    articles_for_teh_author =
      Article.search('needle', :facets => {:author_id => author.id})


DESCRIPTION

there two main pathways to understand in the code.

1) shit going into the into the index. 2) shit coming out of the index.

shit going in entails:

  • stem and stopword the search terms
  • create or update a new token for each
  • create an index item referening all the tokens with precomputed scores

for example the terms 'dog dogs cat' might result in these tokens


  [
    {
      '_id'   : '0x1',
      'value' : 'dog',
      'count' : 2
    },


    {
      '_id'   : '0x2',
      'value' : 'cat',
      'count' : 1
    }
  ]

being created|updated and this index item


    {
      '_id'        : '50c11759a04745961e000001'

      'model_type' : 'Article',
      'model_id'   : '50c11775a04745461f000001'

      'tokens'     : ['0x1', '0x2'],

      'score'      : 10,

      'keyword_scores' : {
        '0x1' : 2,
        '0x2' : 1
      },

      'fulltext_scores' : {
      }
    }


being built

some other information is tracked, but the two normal mongoid models

  • Mongoid::Haystack::Token
  • Mongoid::Haystack::Index

are simple to look at and compromise 80% of the library functionality.

a few things to notice:

  • tokens are counted and auto-id'd using hex notation and a sequence generator. the reason for this is so that their ids are legit hash keys in the keyword and fulltext score hashes (they are also smaller than 12 byte object_ids or the words themselves). aka this sort can be contructed:
    order_by('keyword_scores.0x1' => :desc, 'keyword_scores.0x.1' => :desc)
  • the data structure above allows both filtering for index items that have certain tokens, but also ordering them based on global, keyword, and fulltext score without resorting to map-reduce: a b-tree index can be used.

  • all tokens have their text/stem stored exactly once. aka: we do not store 'hugewords' all over the place but store it once and count occurances of it to keep the total index much smaller

pulling objects back out in a search involved these logical steps:

  • filter the search terms through the same tokenizer as when indexed

  • lookup tokens for each of the tokens in the search string

  • using the count for each token, plus the global token count that has been tracked we can decide to order the results by relatively rare words first and, all else being equal (same rarity bin: 0.10, 0.20, 0.30, etc.), the order in which the user typed the words

  • this approach is applies and is valid whether we are doing a union (or) or intersection (all) search and regardless of whether facets are included in the search. facets, however, never affect the order unless done so by the user manually. eg


  results =
    Mongoid::Haystack.
      search('foo bar', :facets => {:hotness.gte => 11}).
        order_by('facets.hotness' => :desc)

SEE ALSO

tests: ./test/mongoid-haystack_test.rb