tomoto

:tomato: tomoto - high performance topic modeling - for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem 'tomoto'

It can take 10-20 minutes to compile the extension.

Getting Started

Train a model

model = Tomoto::LDA.new(k: 2)
model.add_doc("text from document one")
model.add_doc("text from document two")
model.add_doc("text from document three")
model.train(100) # iterations

Get the summary

model.summary

Get topic words

model.topic_words

Save the model to a file

model.save("model.bin")

Load the model from a file

model = Tomoto::LDA.load("model.bin")

Get topic probabilities for a document

doc = model.docs[0]
doc.topics

Get the number of words for each topic

model.count_by_topics

Get the vocab

model.vocabs

Get the log likelihood per word

model.ll_per_word

Perform inference for unseen documents

doc = model.make_doc("unseen doc")
topic_dist, ll = model.infer(doc)

Models

Supports:

  • Latent Dirichlet Allocation (LDA)
  • Labeled LDA (LLDA)
  • Partially Labeled LDA (PLDA)
  • Supervised LDA (SLDA)
  • Dirichlet Multinomial Regression (DMR)
  • Generalized Dirichlet Multinomial Regression (GDMR)
  • Hierarchical Dirichlet Process (HDP)
  • Hierarchical LDA (HLDA)
  • Multi Grain LDA (MGLDA)
  • Pachinko Allocation (PA)
  • Hierarchical PA (HPA)
  • Correlated Topic Model (CT)
  • Dynamic Topic Model (DT)

API

This library follows the tomotopy API. There are a few changes to make it more Ruby-like:

  • The get_ prefix has been removed from methods (topic_words instead of get_topic_words)
  • Methods that return booleans use ? instead of is_ (live_topic? instead of is_live_topic)

If a method or option you need isn’t supported, feel free to open an issue.

Examples

Tokenization

Documents are tokenized by whitespace by default, or you can perform your own tokenization.

model.add_doc(["tokens", "from", "document", "one"])

Performance

tomoto uses AVX2, AVX, or SSE2 instructions to increase performance on machines that support it. Check which instruction set architecture it’s using with:

Tomoto.isa

Parallelism

Choose a parallelism algorithm with:

model.train(parallel: :partition)

Supported values are :default, :none, :copy_merge, and :partition.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone --recursive https://github.com/ankane/tomoto.git
cd tomoto
bundle install
bundle exec rake compile
bundle exec rake test