Class: NanoGPT::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/nano_gpt/tokenizer.rb

Overview

Base tokenizer interface

Direct Known Subclasses

CharTokenizer, GPT2Tokenizer

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Instance Attribute Details

#vocab_sizeObject (readonly)

Returns the value of attribute vocab_size.



8
9
10
# File 'lib/nano_gpt/tokenizer.rb', line 8

def vocab_size
  @vocab_size
end

Class Method Details

.for_dataset(dataset_dir) ⇒ Object

Auto-detect and load the appropriate tokenizer If meta.json exists, use character-level; otherwise use GPT-2 BPE



20
21
22
23
24
25
26
27
# File 'lib/nano_gpt/tokenizer.rb', line 20

def self.for_dataset(dataset_dir)
  meta_path = File.join(dataset_dir, "meta.json")
  if File.exist?(meta_path)
    CharTokenizer.from_file(meta_path)
  else
    GPT2Tokenizer.new
  end
end

Instance Method Details

#decode(ids) ⇒ Object

Raises:

  • (NotImplementedError)


14
15
16
# File 'lib/nano_gpt/tokenizer.rb', line 14

def decode(ids)
  raise NotImplementedError
end

#encode(text) ⇒ Object

Raises:

  • (NotImplementedError)


10
11
12
# File 'lib/nano_gpt/tokenizer.rb', line 10

def encode(text)
  raise NotImplementedError
end