Class: Tokkens::Tokenizer
- Inherits:
-
Object
- Object
- Tokkens::Tokenizer
- Defined in:
- lib/tokkens/tokenizer.rb
Overview
Converts a string to a list of token numbers.
Useful for computing with text, like machine learning. Before using the tokenizer, you’re expected to have pre-processed the textdepending on application. For example, converting to lowercase, removing non-word characters, transliterating accented characters.
This class then splits the string into tokens by whitespace, and removes tokens not passing the selection criteria.
Constant Summary collapse
- MIN_LENGTH =
default minimum token length
2- STOP_WORDS =
no default stop words to ignore
[]
Instance Attribute Summary collapse
-
#min_length ⇒ Object
readonly
Returns the value of attribute min_length.
-
#stop_words ⇒ Array<String>
readonly
Stop words to ignore.
-
#tokens ⇒ Tokens
readonly
Object to use for obtaining tokens.
Instance Method Summary collapse
-
#get(s, **kwargs) ⇒ Array<Fixnum>
Array of token numbers.
-
#initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS) ⇒ Tokenizer
constructor
Create a new tokenizer.
Constructor Details
#initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS) ⇒ Tokenizer
Create a new tokenizer
35 36 37 38 39 |
# File 'lib/tokkens/tokenizer.rb', line 35 def initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS) @tokens = tokens || Tokens.new @stop_words = stop_words @min_length = min_length end |
Instance Attribute Details
#min_length ⇒ Object (readonly)
Returns the value of attribute min_length.
28 |
# File 'lib/tokkens/tokenizer.rb', line 28 attr_reader :tokens, :stop_words, :min_length |
#stop_words ⇒ Array<String> (readonly)
Returns stop words to ignore.
28 |
# File 'lib/tokkens/tokenizer.rb', line 28 attr_reader :tokens, :stop_words, :min_length |
#tokens ⇒ Tokens (readonly)
Returns object to use for obtaining tokens.
28 29 30 |
# File 'lib/tokkens/tokenizer.rb', line 28 def tokens @tokens end |
Instance Method Details
#get(s, **kwargs) ⇒ Array<Fixnum>
Returns array of token numbers.
42 43 44 45 |
# File 'lib/tokkens/tokenizer.rb', line 42 def get(s, **kwargs) return [] if !s || s.strip == '' tokenize(s).map {|token| @tokens.get(token, **kwargs) }.compact end |