Module: Greeb::Tokenizer

Extended by:
Tokenizer
Included in:
Tokenizer
Defined in:
lib/greeb/tokenizer.rb

Overview

Greeb's tokenization facilities. Use 'em with love.

Unicode character categories been obtained from <www.fileformat.info/info/unicode/category/index.htm>.

Constant Summary collapse

LETTERS =

English and Russian letters.

/[\p{L}]+/u
FLOATS =

Floating point values.

/(\d+)[.,](\d+)/u
INTEGERS =

Integer values.

/\d+/u
SENTENCE_PUNCTUATIONS =

In-sentence punctuation character (i.e.: “,” or “-”).

/(\,|\-|:|;|\p{Ps}|\p{Pi}|\p{Pf}|\p{Pe})+/u
PUNCTUATIONS =

Punctuation character (i.e.: “.” or “!”).

/[(\.|\!|\?)]+/u
SEPARATORS =

In-subsentence seprator (i.e.: “*” or “=”).

/[\p{Nl}\p{No}\p{Pd}\p{Pc}\p{Po}\p{Sm}\p{So}\p{Sc}\p{Zl}\p{Zp}]+/u
SPACES =

Spaces (i.e.: “ ” or &nbsp).

/[\p{Zs}\t]+/u
BREAKS =

Line breaks.

/(\r\n|\n|\r)+/u
RESIDUALS =

Residuals.

/([\p{C}\p{M}\p{Sk}]|[\p{Nd}&&[^\d]])+/u

Instance Method Summary collapse

Instance Method Details

#split(token) ⇒ Array<String>

Split one line into characters array, but also combine duplicated characters.

For instance, `“a bnnnc”` would be transformed into the following array: `[“a”, “ ”, “b”, “nnn”, “c”]`.

Parameters:

  • token (String)

    a token to be splitted.

Returns:

  • (Array<String>)

    splitted characters.


81
82
83
# File 'lib/greeb/tokenizer.rb', line 81

def split(token)
  token.scan(/((.|\n)\2*)/).map!(&:first)
end

#tokenize(text) ⇒ Array<Greeb::Span>

Perform the tokenization process.

Parameters:

  • text (String)

    a text to be tokenized.

Returns:


51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/greeb/tokenizer.rb', line 51

def tokenize text
  scanner = Greeb::StringScanner.new(text)
  tokens = []
  while !scanner.eos?
    parse! scanner, tokens, LETTERS, :letter or
    parse! scanner, tokens, FLOATS, :float or
    parse! scanner, tokens, INTEGERS, :integer or
    split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
    split_parse! scanner, tokens, PUNCTUATIONS, :punct or
    split_parse! scanner, tokens, SEPARATORS, :separ or
    split_parse! scanner, tokens, SPACES, :space or
    split_parse! scanner, tokens, BREAKS, :break or
    parse! scanner, tokens, RESIDUALS, :residual or
    raise Greeb::UnknownSpan.new(text, scanner.char_pos)
  end
  tokens
ensure
  scanner.terminate
end