Ruby Vector Space Model (VSM) with tf*idf weights

Calculates the similarity between texts using a bag-of-words Vector Space Model with Term Frequency-Inverse Document Frequency weights. If your use case demands performance, use Lucene (or similar), which also implements other information retrieval functions like BM 25.

Usage

require 'tf-idf-similarity'

corpus = TfIdfSimilarity::Collection.new
corpus << TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
corpus << TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
corpus << TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")

p corpus.similarity_matrix

This gem will use the gsl gem if available, for faster matrix multiplication.

Optimizations

GNU Scientific Library (GSL)

The latest gsl gem (1.14.7) is not compatible with the gsl package (1.15) in Homebrew:

cd /usr/local
git checkout -b gsl-1.14 83ed49411f076e30ced04c2cbebb054b2645a431
brew install gsl
git checkout master
git branch -d gsl-1.14
gem install gsl

Automatically Tuned Linear Algebra Software (ATLAS)

You may know this software through Linear Algebra PACKage (LAPACK) or Basic Linear Algebra Subprograms (BLAS).

The nmatrix gem (0.0.1) can't find the cblas.h and clapack.h header files. Either set the C_INCLUDE_PATH:

export C_INCLUDE_PATH=/System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/

Or create links before installing the gem:

sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/cblas.h /usr/include/cblas.h
sudo ln -s /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers/clapack.h /usr/include/clapack.h

Version 0.0.2 doesn't compile on Mac OS X Lion.

Other Considerations

The narray and nmatrix gems have no method to calculate the magnitude of a vector. Ruby-LAPACK is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme. Linalg and RNum and old and not available as gems.

Extras

You can access more term frequency, document frequency, and normalization formulas with:

require 'tf-idf-similarity/extras/collection'
require 'tf-idf-similarity/extras/document'

The default tf*idf formula follows the Lucene Conceptual Scoring Formula.

Reference

Bugs? Questions?

This gem's main repository is on GitHub: http://github.com/opennorth/tf-idf-similarity, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.

Copyright (c) 2012 Open North Inc., released under the MIT license