Class: Tokenizer::WhitespaceTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/tokenizer/tokenizer.rb

Overview

Simple whitespace based tokenizer with configurable punctuation detection.

Direct Known Subclasses

Tokenizer

Constant Summary collapse

FS =

Default whitespace separator.

Regexp.new('[[:blank:]]+')
SIMPLE_PRE =

Characters only in the role of splittable prefixes.

['¿', '¡']
SIMPLE_POST =

Characters only in the role of splittable suffixes.

['!', '?', ',', ':', ';', '.']
PAIR_PRE =

Characters as splittable prefixes with an optional matching suffix.

['(', '{', '[', '<', '«', '']
PAIR_POST =

Characters as splittable suffixes with an optional matching prefix.

[')', '}', ']', '>', '»', '']
PRE_N_POST =

Characters which can be both prefixes AND suffixes.

['"', "'"]

Instance Method Summary collapse

Constructor Details

#initialize(lang = :de, options = {}) ⇒ WhitespaceTokenizer

Returns a new instance of WhitespaceTokenizer.

Parameters:

  • lang (Symbol) (defaults to: :de)

    Language identifier.

  • options (Hash) (defaults to: {})

    Additional options.

Options Hash (options):

  • :pre (Array)

    Array of splittable prefix characters.

  • :post (Array)

    Array of splittable suffix characters.

  • :pre_n_post (Array)

    Array of characters with suffix AND prefix functions.



35
36
37
38
39
40
41
42
# File 'lib/tokenizer/tokenizer.rb', line 35

def initialize(lang = :de, options = {})
  @lang = lang
  @options = {
    pre: SIMPLE_PRE + PAIR_PRE,
    post: SIMPLE_POST + PAIR_POST,
    pre_n_post: PRE_N_POST
  }.merge(options)
end

Instance Method Details

#sanitize_input(str) ⇒ String (private)

Returns A new modified string.

Parameters:

  • str (String)

    User defined string to be tokenized.

Returns:

  • (String)

    A new modified string.



69
70
71
# File 'lib/tokenizer/tokenizer.rb', line 69

def sanitize_input(str)
  str.chomp.strip
end

#tokenize(str) ⇒ Array<String> Also known as: process

Returns Array of tokens.

Parameters:

  • str (String)

    String to be tokenized.

Returns:

  • (Array<String>)

    Array of tokens.



46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/tokenizer/tokenizer.rb', line 46

def tokenize(str)
  tokens = sanitize_input(str).split(FS)
  return [''] if tokens.empty?

  splittables = SIMPLE_PRE + SIMPLE_POST + PAIR_PRE + PAIR_POST + PRE_N_POST
  pattern = Regexp.new("[^#{Regexp.escape(splittables.join)}]+")
  output = []
  tokens.each do |token|
    prefix, stem, suffix = token.partition(pattern)
    output << prefix.split('') unless prefix.empty?
    output << stem unless stem.empty?
    output << suffix.split('') unless suffix.empty?
  end

  output.flatten
end