fastText
fastText - efficient text classification and representation learning - for Ruby
Installation
Add this line to your application’s Gemfile:
“by gem ‘fasttext’
“
Getting Started
fastText has two primary use cases:
Text Classification
Prep your data
“by
documents
x = [text from document one, text from document two, text from document three]
labels
y = [ham, ham, spam]
“
Use an array if a document has multiple labels
Train a model
“by model = FastText::Classifier.new model.fit(x, y)
“
Get predictions
“by model.predict(x)
“
Save the model to a file
“by model.save_model(“model.bin”)
“
Load the model from a file
“by model = FastText.load_model(“model.bin”)
“
Evaluate the model
“by model.test(x_test, y_test)
“
Get words and labels
“by model.words model.labels
“
Use
include_freq: trueto get their frequency
Search for the best hyperparameters
“by model.fit(x, y, autotune_set: [x_valid, y_valid])
“
Compress the model - significantly reduces size but sacrifices a little performance
“by model.quantize model.save_model(“model.ftz”)
“
Word Representations
Prep your data
“by x = [text from document one, text from document two, text from document three]
“
Train a model
“by model = FastText::Vectorizer.new model.fit(x)
“
Get nearest neighbors
“by model.nearest_neighbors(“asparagus”)
“
Get analogies
“by model.analogies(“berlin”, “germany”, “france”)
“
Get a word vector
“by model.word_vector(“carrot”)
“
Get a sentence vector
“by model.sentence_vector(“sentence text”)
“
Get words
“by model.words
“
Save the model to a file
“by model.save_model(“model.bin”)
“
Load the model from a file
“by model = FastText.load_model(“model.bin”)
“
Use continuous bag-of-words
“by model = FastText::Vectorizer.new(model: “cbow”)
“
Parameters
Text classification
“by FastText::Classifier.new( lr: 0.1, # learning rate dim: 100, # size of word vectors ws: 5, # size of the context window epoch: 5, # number of epochs min_count: 1, # minimal number of word occurences min_count_label: 1, # minimal number of label occurences minn: 0, # min length of char ngram maxn: 0, # max length of char ngram neg: 5, # number of negatives sampled word_ngrams: 1, # max length of word ngram loss: “softmax”, # loss function hs, softmax, ova bucket: 2000000, # number of buckets thread: 3, # number of threads lr_update_rate: 100, # change the rate of updates for the learning rate t: 0.0001, # sampling threshold label_prefix: “label”, # label prefix verbose: 2, # verbose pretrained_vectors: nil, # pretrained word vectors (.vec file) autotune_metric: “f1”, # autotune optimization metric autotune_predictions: 1, # autotune predictions autotune_duration: 300, # autotune search time in seconds autotune_model_size: nil # autotune model size, like 2M )
“
Word representations
“by FastText::Vectorizer.new( model: “skipgram”, # unsupervised fasttext model skipgram lr: 0.05, # learning rate dim: 100, # size of word vectors ws: 5, # size of the context window epoch: 5, # number of epochs min_count: 5, # minimal number of word occurences minn: 3, # min length of char ngram maxn: 6, # max length of char ngram neg: 5, # number of negatives sampled word_ngrams: 1, # max length of word ngram loss: “ns”, # loss function hs, softmax, ova bucket: 2000000, # number of buckets thread: 3, # number of threads lr_update_rate: 100, # change the rate of updates for the learning rate t: 0.0001, # sampling threshold verbose: 2 # verbose )
“
Input Files
Input can be read directly from files
“by model.fit(“train.txt”, autotune_set: “valid.txt”) model.test(“test.txt”)
“
Each line should be a document
“t text from document one text from document two text from document three
“
For text classification, lines should start with a list of labels prefixed with __label__
“t labelham text from document one labelham text from document two labelspam text from document three
“
Pretrained Models
There are a number of pretrained models you can download
Language Identification
Download one of the pretrained models and load it
“by model = FastText.load_model(“lid.176.ftz”)
“
Get language predictions
“by model.predict(“bon appétit”)
“
History
View the changelog
Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
“ git clone –recursive https://github.com/ankane/fastText.git cd fastText bundle install bundle exec rake compile bundle exec rake test
“