Class: NanoGPT::GPT2Tokenizer

Inherits:

Tokenizer

Object
Tokenizer
NanoGPT::GPT2Tokenizer

show all

Defined in:: lib/nano_gpt/tokenizer.rb

Overview

GPT-2 BPE tokenizer using tiktoken

Constant Summary collapse

GPT2_VOCAB_SIZE =

EOT_TOKEN =

"<|endoftext|>"

Instance Attribute Summary

Attributes inherited from Tokenizer

#vocab_size

Instance Method Summary collapse

#decode(ids) ⇒ Object

Decode list of integers to string.
#encode(text) ⇒ Object

Encode string to list of integers.
#eot_token ⇒ Object

Get the end-of-text token ID.
#initialize ⇒ GPT2Tokenizer constructor

A new instance of GPT2Tokenizer.

Methods inherited from Tokenizer

for_dataset

Constructor Details

#initialize ⇒ `GPT2Tokenizer`

Returns a new instance of GPT2Tokenizer.

# File 'lib/nano_gpt/tokenizer.rb', line 83

def initialize
  super()
  require "tiktoken_ruby"
  # GPT-2 uses the r50k_base encoding
  @enc = Tiktoken.get_encoding(:r50k_base)
  @vocab_size = GPT2_VOCAB_SIZE
end

Instance Method Details

#decode(ids) ⇒ `Object`

Decode list of integers to string



97
98
99

# File 'lib/nano_gpt/tokenizer.rb', line 97

def decode(ids)
  @enc.decode(ids)
end

#encode(text) ⇒ `Object`

Encode string to list of integers



92
93
94

# File 'lib/nano_gpt/tokenizer.rb', line 92

def encode(text)
  @enc.encode(text)
end

#eot_token ⇒ `Object`

Get the end-of-text token ID



102
103
104

# File 'lib/nano_gpt/tokenizer.rb', line 102

def eot_token
  @enc.encode(EOT_TOKEN).first
end