fastText

fastText - efficient text classification and representation learning - for Ruby

Installation

Add this line to your application’s Gemfile:

“by gem ‘fasttext’

“

Getting Started

fastText has two primary use cases:

text classification
word representations

Text Classification

Prep your data

“by

documents

x = [text from document one, text from document two, text from document three]

labels

y = [ham, ham, spam]

“

Use an array if a document has multiple labels

Train a model

“by model = FastText::Classifier.new model.fit(x, y)

“

Get predictions

“by model.predict(x)

“

Save the model to a file

“by model.save_model(“model.bin”)

“

Load the model from a file

“by model = FastText.load_model(“model.bin”)

“

Evaluate the model

“by model.test(x_test, y_test)

“

Get words and labels

“by model.words model.labels

“

Use include_freq: true to get their frequency

Search for the best hyperparameters

“by model.fit(x, y, autotune_set: [x_valid, y_valid])

“

Compress the model - significantly reduces size but sacrifices a little performance

“by model.quantize model.save_model(“model.ftz”)

“

Word Representations

Prep your data

“by x = [text from document one, text from document two, text from document three]

“

Train a model

“by model = FastText::Vectorizer.new model.fit(x)

“

Get nearest neighbors

“by model.nearest_neighbors(“asparagus”)

“

Get analogies

“by model.analogies(“berlin”, “germany”, “france”)

“

Get a word vector

“by model.word_vector(“carrot”)

“

Get a sentence vector

“by model.sentence_vector(“sentence text”)

“

Get words

“by model.words

“

Save the model to a file

“by model.save_model(“model.bin”)

“

Load the model from a file

“by model = FastText.load_model(“model.bin”)

“

Use continuous bag-of-words

“by model = FastText::Vectorizer.new(model: “cbow”)

“

Parameters

Text classification

“by FastText::Classifier.new( lr: 0.1, # learning rate dim: 100, # size of word vectors ws: 5, # size of the context window epoch: 5, # number of epochs min_count: 1, # minimal number of word occurences min_count_label: 1, # minimal number of label occurences minn: 0, # min length of char ngram maxn: 0, # max length of char ngram neg: 5, # number of negatives sampled word_ngrams: 1, # max length of word ngram loss: “softmax”, # loss function hs, softmax, ova bucket: 2000000, # number of buckets thread: 3, # number of threads lr_update_rate: 100, # change the rate of updates for the learning rate t: 0.0001, # sampling threshold label_prefix: “label”, # label prefix verbose: 2, # verbose pretrained_vectors: nil, # pretrained word vectors (.vec file) autotune_metric: “f1”, # autotune optimization metric autotune_predictions: 1, # autotune predictions autotune_duration: 300, # autotune search time in seconds autotune_model_size: nil # autotune model size, like 2M )

“

Word representations

“by FastText::Vectorizer.new( model: “skipgram”, # unsupervised fasttext model skipgram lr: 0.05, # learning rate dim: 100, # size of word vectors ws: 5, # size of the context window epoch: 5, # number of epochs min_count: 5, # minimal number of word occurences minn: 3, # min length of char ngram maxn: 6, # max length of char ngram neg: 5, # number of negatives sampled word_ngrams: 1, # max length of word ngram loss: “ns”, # loss function hs, softmax, ova bucket: 2000000, # number of buckets thread: 3, # number of threads lr_update_rate: 100, # change the rate of updates for the learning rate t: 0.0001, # sampling threshold verbose: 2 # verbose )

“

Input Files

Input can be read directly from files

“by model.fit(“train.txt”, autotune_set: “valid.txt”) model.test(“test.txt”)

“

Each line should be a document

“t text from document one text from document two text from document three

“

For text classification, lines should start with a list of labels prefixed with __label__

“t labelham text from document one labelham text from document two labelspam text from document three

“

Pretrained Models

There are a number of pretrained models you can download

Language Identification

Download one of the pretrained models and load it

“by model = FastText.load_model(“lid.176.ftz”)

“

Get language predictions

“by model.predict(“bon appétit”)

“

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

“ git clone –recursive https://github.com/ankane/fastText.git cd fastText bundle install bundle exec rake compile bundle exec rake test

“