tomoto
:tomato: tomoto - high performance topic modeling - for Ruby
Installation
Add this line to your application’s Gemfile:
gem 'tomoto'
It can take 10-20 minutes to compile the extension.
Getting Started
Train a model
model = Tomoto::LDA.new(k: 2)
model.add_doc("text from document one")
model.add_doc("text from document two")
model.add_doc("text from document three")
model.train(100) # iterations
Get the summary
model.summary
Get topic words
model.topic_words
Save the model to a file
model.save("model.bin")
Load the model from a file
model = Tomoto::LDA.load("model.bin")
Get topic probabilities for a document
doc = model.docs[0]
doc.topics
Get the number of words for each topic
model.count_by_topics
Get the vocab
model.vocabs
Get the log likelihood per word
model.ll_per_word
Perform inference for unseen documents
doc = model.make_doc("unseen doc")
topic_dist, ll = model.infer(doc)
Models
Supports:
- Latent Dirichlet Allocation (
LDA) - Labeled LDA (
LLDA) - Partially Labeled LDA (
PLDA) - Supervised LDA (
SLDA) - Dirichlet Multinomial Regression (
DMR) - Generalized Dirichlet Multinomial Regression (
GDMR) - Hierarchical Dirichlet Process (
HDP) - Hierarchical LDA (
HLDA) - Multi Grain LDA (
MGLDA) - Pachinko Allocation (
PA) - Hierarchical PA (
HPA) - Correlated Topic Model (
CT) - Dynamic Topic Model (
DT)
API
This library follows the tomotopy API. There are a few changes to make it more Ruby-like:
- The
get_prefix has been removed from methods (topic_wordsinstead ofget_topic_words) - Methods that return booleans use
?instead ofis_(live_topic?instead ofis_live_topic)
If a method or option you need isn’t supported, feel free to open an issue.
Examples
Tokenization
Documents are tokenized by whitespace by default, or you can perform your own tokenization.
model.add_doc(["tokens", "from", "document", "one"])
Performance
tomoto uses AVX2, AVX, or SSE2 instructions to increase performance on machines that support it. Check which instruction set architecture it’s using with:
Tomoto.isa
Parallelism
Choose a parallelism algorithm with:
model.train(parallel: :partition)
Supported values are :default, :none, :copy_merge, and :partition.
History
View the changelog
Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone --recursive https://github.com/ankane/tomoto.git
cd tomoto
bundle install
bundle exec rake compile
bundle exec rake test