fastText Ruby

fastText - efficient text classification and representation learning - for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem 'fasttext'

Getting Started

fastText has two primary use cases:

Text Classification

Prep your data

# documents
x = [
  "text from document one",
  "text from document two",
  "text from document three"
]

# labels
y = ["ham", "ham", "spam"]

Use an array if a document has multiple labels

Train a model

model = FastText::Classifier.new
model.fit(x, y)

Get predictions

model.predict(x)

Save the model to a file

model.save_model("model.bin")

Load the model from a file

model = FastText.load_model("model.bin")

Evaluate the model

model.test(x_test, y_test)

Get words and labels

model.words
model.labels

Use include_freq: true to get their frequency

Search for the best hyperparameters

model.fit(x, y, autotune_set: [x_valid, y_valid])

Compress the model - significantly reduces size but sacrifices a little performance

model.quantize
model.save_model("model.ftz")

Word Representations

Prep your data

x = [
  "text from document one",
  "text from document two",
  "text from document three"
]

Train a model

model = FastText::Vectorizer.new
model.fit(x)

Get nearest neighbors

model.nearest_neighbors("asparagus")

Get analogies

model.analogies("berlin", "germany", "france")

Get a word vector

model.word_vector("carrot")

Get a sentence vector

model.sentence_vector("sentence text")

Get words

model.words

Save the model to a file

model.save_model("model.bin")

Load the model from a file

model = FastText.load_model("model.bin")

Use continuous bag-of-words

model = FastText::Vectorizer.new(model: "cbow")

Parameters

Text classification

FastText::Classifier.new(
  lr: 0.1,                    # learning rate
  dim: 100,                   # size of word vectors
  ws: 5,                      # size of the context window
  epoch: 5,                   # number of epochs
  min_count: 1,               # minimal number of word occurences
  min_count_label: 1,         # minimal number of label occurences
  minn: 0,                    # min length of char ngram
  maxn: 0,                    # max length of char ngram
  neg: 5,                     # number of negatives sampled
  word_ngrams: 1,             # max length of word ngram
  loss: "softmax",            # loss function {ns, hs, softmax, ova}
  bucket: 2000000,            # number of buckets
  thread: 3,                  # number of threads
  lr_update_rate: 100,        # change the rate of updates for the learning rate
  t: 0.0001,                  # sampling threshold
  label_prefix: "__label__",  # label prefix
  verbose: 2,                 # verbose
  pretrained_vectors: nil,    # pretrained word vectors (.vec file)
  autotune_metric: "f1",      # autotune optimization metric
  autotune_predictions: 1,    # autotune predictions
  autotune_duration: 300,     # autotune search time in seconds
  autotune_model_size: nil    # autotune model size, like 2M
)

Word representations

FastText::Vectorizer.new(
  model: "skipgram",          # unsupervised fasttext model {cbow, skipgram}
  lr: 0.05,                   # learning rate
  dim: 100,                   # size of word vectors
  ws: 5,                      # size of the context window
  epoch: 5,                   # number of epochs
  min_count: 5,               # minimal number of word occurences
  minn: 3,                    # min length of char ngram
  maxn: 6,                    # max length of char ngram
  neg: 5,                     # number of negatives sampled
  word_ngrams: 1,             # max length of word ngram
  loss: "ns",                 # loss function {ns, hs, softmax, ova}
  bucket: 2000000,            # number of buckets
  thread: 3,                  # number of threads
  lr_update_rate: 100,        # change the rate of updates for the learning rate
  t: 0.0001,                  # sampling threshold
  verbose: 2                  # verbose
)

Input Files

Input can be read directly from files

model.fit("train.txt", autotune_set: "valid.txt")
model.test("test.txt")

Each line should be a document

text from document one
text from document two
text from document three

For text classification, lines should start with a list of labels prefixed with __label__

__label__ham text from document one
__label__ham text from document two
__label__spam text from document three

Pretrained Models

There are a number of pretrained models you can download

Language Identification

Download one of the pretrained models and load it

model = FastText.load_model("lid.176.ftz")

Get language predictions

model.predict("bon appétit")

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone --recursive https://github.com/ankane/fastText-ruby.git
cd fastText-ruby
bundle install
bundle exec rake compile
bundle exec rake test