Ruby Vector Space Model (VSM) with tf*idf weights
Calculates the similarity between texts using a bag-of-words Vector Space Model with Term Frequency-Inverse Document Frequency weights. If your use case demands performance, use Lucene or similar (see below).
Usage
require 'tf-idf-similarity'
corpus = TfIdfSimilarity::Collection.new
corpus << TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
corpus << TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
corpus << TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
p corpus.similarity_matrix
This gem will use the gsl gem if available, for faster matrix multiplication.
Optimizations
NArray
gem install narray
GNU Scientific Library (GSL)
The latest gsl
gem (1.14.7
) is not compatible with the gsl
package (1.15
) in Homebrew:
cd /usr/local
git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
brew install gsl
git checkout master
git branch -d gsl-1.14
gem install gsl
Automatically Tuned Linear Algebra Software (ATLAS)
You may know this software through Linear Algebra PACKage (LAPACK) or Basic Linear Algebra Subprograms (BLAS). You can use it through version 0.0.2
of the nmatrix gem. As of writing, 0.0.2
is not released, so follow these instructions to install it. You may need additional instructions for Mac OS X Lion.
Other Options
The nmatrix gem has no easy way to normalize all columns to unit vectors. Ruby-LAPACK is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. Linalg and RNum are old and not available as gems.
Extras
You can access more term frequency, document frequency, and normalization formulas with:
require 'tf-idf-similarity/extras/collection'
require 'tf-idf-similarity/extras/document'
The default tf*idf formula follows the Lucene Conceptual Scoring Formula.
Why?
The treat, tf-idf, similarity and rsimilarity gems normalize the frequency of a term in a document to the number of terms in that document (which, as far as I can tell, never occurs in the academic literature) and have no normalization component. vss uses plain term and document frequencies, with no damping or normalization.
Reference
- G. Salton and C. Buckley. "Term Weighting Approaches in Automatic Text Retrieval."" Technical Report. Cornell University, Ithaca, NY, USA. 1987.
- E. Chisholm and T. G. Kolda. "New term weighting formulas for the vector space method in information retrieval." Technical Report Number ORNL-TM-13756. Oak Ridge National Laboratory, Oak Ridge, TN, USA. 1999.
Further Reading
Lucene implements many more similarity functions, such as:
- a divergence from randomness (DFR) framework
- a framework for the family of information-based models
- a language model with Bayesian smoothing using Dirichlet priors
- a language model with Jelinek-Mercer smoothing
Lucene can even combine similarity meatures.
Bugs? Questions?
This gem's main repository is on GitHub: http://github.com/opennorth/tf-idf-similarity, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
Copyright (c) 2012 Open North Inc., released under the MIT license