tokenizer-ruby

Ruby bindings for HuggingFace Tokenizers. Fast, Rust-powered tokenization for any HuggingFace model — GPT-2, BERT, LLaMA, Claude, and more.

Installation

gem install tokenizer-ruby

Or add to your Gemfile:

gem "tokenizer-ruby"

Note: Requires Rust toolchain for compilation. Install via rustup.

Usage

Load a tokenizer

require "tokenizer_ruby"

# From HuggingFace Hub
tokenizer = TokenizerRuby::Tokenizer.from_pretrained("gpt2")
tokenizer = TokenizerRuby::Tokenizer.from_pretrained("bert-base-uncased")

# From a local file
tokenizer = TokenizerRuby::Tokenizer.from_file("/path/to/tokenizer.json")

Encode and decode

encoding = tokenizer.encode("Hello, world!")
encoding.ids           # => [15496, 11, 995, 0]
encoding.tokens        # => ["Hello", ",", " world", "!"]
encoding.offsets       # => [[0, 5], [5, 6], [6, 12], [12, 13]]
encoding.attention_mask # => [1, 1, 1, 1]
encoding.length        # => 4

tokenizer.decode([15496, 11, 995, 0])  # => "Hello, world!"

Batch processing

encodings = tokenizer.encode_batch(["Hello", "World"])
decoded = tokenizer.decode_batch(encodings.map(&:ids))
# => ["Hello", "World"]

Token counting

tokenizer.count("Hello, world!")  # => 4

Truncation

# Truncate text to a token limit
tokenizer.truncate("This is a long sentence...", max_tokens: 5)

# Enable automatic truncation on all encodes
tokenizer.enable_truncation(max_length: 512)

Padding

tokenizer.enable_padding(length: 128, pad_token: "[PAD]")
encoding = tokenizer.encode("Hello")
encoding.ids.length        # => 128
encoding.attention_mask     # => [1, 0, 0, 0, ...]

Vocabulary

tokenizer.vocab_size           # => 50257
tokenizer.token_to_id("hello") # => 31373
tokenizer.id_to_token(31373)   # => "hello"

Requirements

  • Ruby >= 3.1
  • Rust toolchain (for building from source)

Development

bundle install
bundle exec rake compile
bundle exec rake test

License

MIT

Author

Johannes Dwi Cahyo — @johannesdwicahyo