tomos

Token-aware text chunking for RAG pipelines, powered by Rust.

Tomos wraps the text-splitter Rust crate with tiktoken tokenization and exposes two splitter classes to Ruby: Tomos::Text for plain text and Tomos::Markdown for Markdown documents. Each chunk carries its token count, byte position, and a SHA-256 content fingerprint.

Installation

gem "tomos"

Because tomos includes a native Rust extension, you'll need a Rust toolchain installed. The gem compiles on bundle install.

Usage

Splitting text into chunks

splitter = Tomos::Text.new(model: "gpt-4", capacity: 512)
chunks = splitter.chunks("A long document goes here...")

chunks.each do |chunk|
  chunk.text         # => String  — the chunk content
  chunk.token_count  # => Integer — tokens in this chunk
  chunk.byte_offset  # => Integer — start position in the original string
  chunk.byte_length  # => Integer — byte length of the chunk
  chunk.chunk_id     # => String  — 64-char SHA-256 hex digest
end

The capacity is the maximum number of tokens per chunk. An optional overlap keyword shares tokens between adjacent chunks, which helps preserve context at boundaries:

splitter = Tomos::Text.new(model: "gpt-4", capacity: 512, overlap: 50)

Splitting Markdown

Tomos::Markdown is Markdown-structure-aware — it respects headers, lists, and code fences when deciding where to split:

splitter = Tomos::Markdown.new(model: "gpt-4", capacity: 512)
chunks = splitter.chunks(File.read("document.md"))

Note: tokenization is over the raw input string regardless of splitter type; Markdown differs only in where it chooses split boundaries.

Counting tokens

Count tokens directly without constructing a splitter:

# Class method — resolves the tokenizer fresh each call
Tomos::Text.count_tokens("Hello, world!", model: "gpt-4")
# => 4

Tomos::Markdown.count_tokens("# Hello\n\nWorld", model: "gpt-4")
# => 4

If you already have a splitter instance, the instance method reuses its already-resolved tokenizer:

splitter = Tomos::Text.new(model: "gpt-4", capacity: 512)
splitter.count_tokens("Hello, world!")
# => 4

Both forms return 0 for empty input and raise ArgumentError for unrecognized model names.

Supported models

Any model name recognized by tiktoken, including:

gpt-4, gpt-4o, gpt-4.1, gpt-5
o1, o3, o4 and their versioned variants (e.g. o1-mini, gpt-4o-2024-05-13)
gpt-3.5-turbo
text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large

Unrecognized model names raise ArgumentError.

Chunk metadata

Each Tomos::Chunk exposes:

Method	Type	Description
`text`	`String`	The chunk content
`token_count`	`Integer`	Number of tokens in this chunk
`byte_offset`	`Integer`	Start byte position in the source string
`byte_length`	`Integer`	Byte length of the chunk
`chunk_id`	`String`	64-char lowercase SHA-256 hex digest of the chunk text

The byte metadata lets you map a chunk back to its exact position in the source:

source[chunk.byte_offset, chunk.byte_length] == chunk.text # => true

The chunk_id is deterministic — the same text always produces the same ID, regardless of model, capacity, or overlap.

License

MIT