Tokenizers Ruby

:slightly_smiling_face: Fast state-of-the-art tokenizers for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem "tokenizers"

Note: Rust and pkg-config are currently required for installation, and it can take 5-10 minutes to compile the extension.

Getting Started

Load a pretrained tokenizer

tokenizer = Tokenizers.from_pretrained("bert-base-cased")

Encode

encoded = tokenizer.encode("I can feel the magic, can you?")
encoded.ids
encoded.tokens

Decode

tokenizer.decode(ids)

Load a tokenizer from files

tokenizer = Tokenizers::CharBPETokenizer.new("vocab.json", "merges.txt")

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec ruby ext/tokenizers/extconf.rb && make
bundle exec rake download:files
bundle exec rake test