Class: NanoGPT::GPT2Tokenizer

Inherits:
Tokenizer show all
Defined in:
lib/nano_gpt/tokenizer.rb

Overview

GPT-2 BPE tokenizer using tiktoken

Constant Summary collapse

GPT2_VOCAB_SIZE =
50257
EOT_TOKEN =
"<|endoftext|>"

Instance Attribute Summary

Attributes inherited from Tokenizer

#vocab_size

Instance Method Summary collapse

Methods inherited from Tokenizer

for_dataset

Constructor Details

#initializeGPT2Tokenizer

Returns a new instance of GPT2Tokenizer.



83
84
85
86
87
88
89
# File 'lib/nano_gpt/tokenizer.rb', line 83

def initialize
  super()
  require "tiktoken_ruby"
  # GPT-2 uses the r50k_base encoding
  @enc = Tiktoken.get_encoding(:r50k_base)
  @vocab_size = GPT2_VOCAB_SIZE
end

Instance Method Details

#decode(ids) ⇒ Object

Decode list of integers to string



97
98
99
# File 'lib/nano_gpt/tokenizer.rb', line 97

def decode(ids)
  @enc.decode(ids)
end

#encode(text) ⇒ Object

Encode string to list of integers



92
93
94
# File 'lib/nano_gpt/tokenizer.rb', line 92

def encode(text)
  @enc.encode(text)
end

#eot_tokenObject

Get the end-of-text token ID



102
103
104
# File 'lib/nano_gpt/tokenizer.rb', line 102

def eot_token
  @enc.encode(EOT_TOKEN).first
end