Class: Tokkens::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/tokkens/tokenizer.rb

Overview

Converts a string to a list of token numbers.

Useful for computing with text, like machine learning. Before using the tokenizer, you’re expected to have pre-processed the textdepending on application. For example, converting to lowercase, removing non-word characters, transliterating accented characters.

This class then splits the string into tokens by whitespace, and removes tokens not passing the selection criteria.

Constant Summary collapse

MIN_LENGTH =

default minimum token length

2
STOP_WORDS =

no default stop words to ignore

[]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS) ⇒ Tokenizer

Create a new tokenizer

Parameters:

  • tokens (Tokens) (defaults to: nil)

    object to use for obtaining token numbers

  • min_length (Fixnum) (defaults to: MIN_LENGTH)

    minimum length for tokens

  • stop_words (Array<String>) (defaults to: STOP_WORDS)

    stop words to ignore



35
36
37
38
39
# File 'lib/tokkens/tokenizer.rb', line 35

def initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS)
  @tokens = tokens || Tokens.new
  @stop_words = stop_words
  @min_length = min_length
end

Instance Attribute Details

#min_lengthObject (readonly)

Returns the value of attribute min_length.



28
# File 'lib/tokkens/tokenizer.rb', line 28

attr_reader :tokens, :stop_words, :min_length

#stop_wordsArray<String> (readonly)

Returns stop words to ignore.

Returns:

  • (Array<String>)

    stop words to ignore



28
# File 'lib/tokkens/tokenizer.rb', line 28

attr_reader :tokens, :stop_words, :min_length

#tokensTokens (readonly)

Returns object to use for obtaining tokens.

Returns:

  • (Tokens)

    object to use for obtaining tokens



28
29
30
# File 'lib/tokkens/tokenizer.rb', line 28

def tokens
  @tokens
end

Instance Method Details

#get(s, **kwargs) ⇒ Array<Fixnum>

Returns array of token numbers.

Returns:

  • (Array<Fixnum>)

    array of token numbers



42
43
44
45
# File 'lib/tokkens/tokenizer.rb', line 42

def get(s, **kwargs)
  return [] if !s || s.strip == ''
  tokenize(s).map {|token| @tokens.get(token, **kwargs) }.compact
end