Class: NanoGPT::Tokenizer
- Inherits:
-
Object
- Object
- NanoGPT::Tokenizer
- Defined in:
- lib/nano_gpt/tokenizer.rb
Overview
Base tokenizer interface
Direct Known Subclasses
Instance Attribute Summary collapse
-
#vocab_size ⇒ Object
readonly
Returns the value of attribute vocab_size.
Class Method Summary collapse
-
.for_dataset(dataset_dir) ⇒ Object
Auto-detect and load the appropriate tokenizer If meta.json exists, use character-level; otherwise use GPT-2 BPE.
Instance Method Summary collapse
Instance Attribute Details
#vocab_size ⇒ Object (readonly)
Returns the value of attribute vocab_size.
8 9 10 |
# File 'lib/nano_gpt/tokenizer.rb', line 8 def vocab_size @vocab_size end |
Class Method Details
.for_dataset(dataset_dir) ⇒ Object
Auto-detect and load the appropriate tokenizer If meta.json exists, use character-level; otherwise use GPT-2 BPE
20 21 22 23 24 25 26 27 |
# File 'lib/nano_gpt/tokenizer.rb', line 20 def self.for_dataset(dataset_dir) = File.join(dataset_dir, "meta.json") if File.exist?() CharTokenizer.from_file() else GPT2Tokenizer.new end end |
Instance Method Details
#decode(ids) ⇒ Object
14 15 16 |
# File 'lib/nano_gpt/tokenizer.rb', line 14 def decode(ids) raise NotImplementedError end |
#encode(text) ⇒ Object
10 11 12 |
# File 'lib/nano_gpt/tokenizer.rb', line 10 def encode(text) raise NotImplementedError end |