ActiveRecord support for ts_vectors
This small library extends ActiveRecord to support PostgreSQL's tsvector type. PostgreSQL's tsvector stores an array of keywords that can be efficiently queried. The tsvector type is therefore particularly suitable for storing tags, since it's possible to store all tags in a single column instead of a table.
Requirements
- ActiveRecord
- PostgreSQL with
tsvectorsupport. (This has been included out of the box since 8.3.)
Usage
Add gem to your Gemfile:
gem 'ar-tsvectors', :require => 'activerecord_tsvectors'
To test, add a text column to your table in a migration:
add_column :posts, :tags, :text
Then extend your ActiveRecord model with a declaration:
class Post < ActiveRecord::Base
ts_vector :tags
end
Now you can assign tags:
post = Post.new
post. = ['pizza', 'pepperoni', 'Italian', 'mushrooms']
post.save
You can now query based on the tags:
Post.('pizza')
Post.(['pizza', 'pepperoni'])
For queries to perform well, you must add an index on the column. This is slightly more complicated; again, in a migration:
execute("create index index_posts_on_tags on posts using gin(to_tsvector('simple', tags))")
Options
The ts_vector mixin takes the following options:
:configuration- the name of the PostgreSQL text configuration to use. Defaults tosimple, which is appropriate for keyword-type indexing.:format- the format of the column, must be either:text(the default) or:tsvector.:normalize- a proc that overrides the default normalization. It must take a single string and return a normalized copy of the string.
Methods
Declaring a tsvector column tags will dynamically add the following methods:
with_all_tags(t)- returns a scope that searches fort, which may be either a single value or an array of values. Only rows matching all values will be returned.without_all_tags(t)- returns a scope that excludest, which may be either a single value or an array of values. Only rows matching all values will be excluded.with_any_tags(t)- returns a scope that searches fort, which may be either a single value or an array of values. Rows matching at least one of the provided values will be returned.without_any_tags(t)- returns a scope that excludest, which may be either a single value or an array of values. Rows matching at least one of the provided values will be excluded.order_by_tags_rank(t, direction = 'DESC')- returns a scope that orders the rows by the rank, ie. a score computed based on the overlap oftand the row's value. See below for an example.
Ranking
It's trivial to rank results according to how many values match a given row:
@posts = Post.(["pizza", "pepperoni"]).
(["pizza", "pepperoni"])
This orders the rows such that rows matching both pizza and pepperoni will be ordered above rows matching either pizza or pepperoni but not both.
Normalization
The default behaviour is to downcase and strip leading and trailing spaces from assigned values. By default, matches are case-sensitive; so something tagged with PIZZA will not match a search for pizza. To ensure case insensitivity, you can provide a normalization function, like so:
class Post < ActiveRecord::Base
ts_vector :tags, :normalize => lambda { |v| v.downcase }
end
You can also use this to strip unwanted characters:
class Post < ActiveRecord::Base
ts_vector :tags, :normalize => lambda { |v| v.downcase.gsub(/[^\w]/, '') }
end
The values are normalized both when performing queries, and when assigning new values:
post.tags = [' WTF#$%^ &*??!']
post.tags
=> ['wtf']
Using ts_vector directly
It's possible to use the tsvector type for the column directly. In other words, in your migration would have something like:
add_column :posts, :tags, :tsvector
This is supported by supplying a format option to the ts_vector declaration in your model:
class Post < ActiveRecord::Base
ts_vector :tags, :format => :tsvector
end
When you do this, the way you define a table index is slightly different:
execute("create index index_posts_on_tags on posts using gin(tags)")
While using a tsvector column works, it has some drawbacks:
The
tsvectortype is lossy. When you assign values to atsvector, the text will be tokenized by PostgreSQL's text parser, which will remove all punctuation and quotes from the text. In this system, a quoted tag such as"New York"will be split into two different tagsNewandYork, and it's no longer possible to search forNew Yorkas a distinct tag.Due to a limitation in ActiveRecord, stored column values (on
INSERTandUPDATE) are passed to PostgreSQL as strings, and are therefore not normalized using the text configuration's rules. This means that if you override the normalization function, you must make sure you always strip and downcase in addition to whatever other normalization you do, otherwise queries will potentially not match all rows. A forthcoming version of ActiveRecord will provide the plumbing that will allow us to solve this issue.