Tokenizers Ruby

:slightly_smiling_face: Fast state-of-the-art tokenizers for Ruby

Build Status

Installation

Add this line to your applicationā€™s Gemfile:

gem "tokenizers"

Getting Started

Load a pretrained tokenizer

tokenizer = Tokenizers.from_pretrained("bert-base-cased")

Encode

encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.tokens
encoded.ids

Decode

tokenizer.decode(ids)

Training

Create a tokenizer

tokenizer = Tokenizers::Tokenizer.new(Tokenizers::Models::BPE.new(unk_token: "[UNK]"))

Set the pre-tokenizer

tokenizer.pre_tokenizer = Tokenizers::PreTokenizers::Whitespace.new

Train the tokenizer (example data)

trainer = Tokenizers::Trainers::BpeTrainer.new(special_tokens: ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)

Encode

output = tokenizer.encode("Hello, y'all! How are you šŸ˜ ?")
output.tokens

Save the tokenizer to a file

tokenizer.save("tokenizer.json")

Load a tokenizer from a file

tokenizer = Tokenizers.from_file("tokenizer.json")

Check out the Quicktour and equivalent Ruby code for more info

API

This library follows the Tokenizers Python API. You can follow Python tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test