Class: NanoGPT::GPT2Tokenizer
- Defined in:
- lib/nano_gpt/tokenizer.rb
Overview
GPT-2 BPE tokenizer using tiktoken
Constant Summary collapse
- GPT2_VOCAB_SIZE =
50257- EOT_TOKEN =
"<|endoftext|>"
Instance Attribute Summary
Attributes inherited from Tokenizer
Instance Method Summary collapse
-
#decode(ids) ⇒ Object
Decode list of integers to string.
-
#encode(text) ⇒ Object
Encode string to list of integers.
-
#eot_token ⇒ Object
Get the end-of-text token ID.
-
#initialize ⇒ GPT2Tokenizer
constructor
A new instance of GPT2Tokenizer.
Methods inherited from Tokenizer
Constructor Details
#initialize ⇒ GPT2Tokenizer
Returns a new instance of GPT2Tokenizer.
83 84 85 86 87 88 89 |
# File 'lib/nano_gpt/tokenizer.rb', line 83 def initialize super() require "tiktoken_ruby" # GPT-2 uses the r50k_base encoding @enc = Tiktoken.get_encoding(:r50k_base) @vocab_size = GPT2_VOCAB_SIZE end |
Instance Method Details
#decode(ids) ⇒ Object
Decode list of integers to string
97 98 99 |
# File 'lib/nano_gpt/tokenizer.rb', line 97 def decode(ids) @enc.decode(ids) end |
#encode(text) ⇒ Object
Encode string to list of integers
92 93 94 |
# File 'lib/nano_gpt/tokenizer.rb', line 92 def encode(text) @enc.encode(text) end |
#eot_token ⇒ Object
Get the end-of-text token ID
102 103 104 |
# File 'lib/nano_gpt/tokenizer.rb', line 102 def eot_token @enc.encode(EOT_TOKEN).first end |