tomoto

:tomato: tomoto - high performance topic modeling - for Ruby

Installation

Add this line to your application’s Gemfile:

gem 'tomoto'

It can take 10-20 minutes to compile the extension.

Getting Started

Train a model

model = Tomoto::LDA.new(k: 2)
model.add_doc("text from document one")
model.add_doc("text from document two")
model.add_doc("text from document three")
model.train(100) # iterations

Get the summary

model.summary

Get topic words

model.topic_words

Save the model to a file

model.save("model.bin")

Load the model from a file

model = Tomoto::LDA.load("model.bin")

Get topic probabilities for a document

doc = model.docs[0]
doc.topics

Get the number of words for each topic

model.count_by_topics

Get the vocab

model.vocabs

Get the log likelihood per word

model.ll_per_word

Perform inference for unseen documents

doc = model.make_doc("unseen doc")
topic_dist, ll = model.infer(doc)

Models

Supports:

Latent Dirichlet Allocation (LDA)
Labeled LDA (LLDA)
Partially Labeled LDA (PLDA)
Supervised LDA (SLDA)
Dirichlet Multinomial Regression (DMR)
Generalized Dirichlet Multinomial Regression (GDMR)
Hierarchical Dirichlet Process (HDP)
Hierarchical LDA (HLDA)
Multi Grain LDA (MGLDA)
Pachinko Allocation (PA)
Hierarchical PA (HPA)
Correlated Topic Model (CT)
Dynamic Topic Model (DT)

API

This library follows the tomotopy API. There are a few changes to make it more Ruby-like:

The get_ prefix has been removed from methods (topic_words instead of get_topic_words)
Methods that return booleans use ? instead of is_ (live_topic? instead of is_live_topic)

If a method or option you need isn’t supported, feel free to open an issue.

Examples

Tokenization

Documents are tokenized by whitespace by default, or you can perform your own tokenization.

model.add_doc(["tokens", "from", "document", "one"])

Performance

tomoto uses AVX2, AVX, or SSE2 instructions to increase performance on machines that support it. Check which instruction set architecture it’s using with:

Tomoto.isa

Parallelism

Choose a parallelism algorithm with:

model.train(parallel: :partition)

Supported values are :default, :none, :copy_merge, and :partition.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone --recursive https://github.com/ankane/tomoto.git
cd tomoto
bundle install
bundle exec rake compile
bundle exec rake test