Class: Boilerpipe::UnicodeTokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/boilerpipe/util/unicode_tokenizer.rb

Constant Summary collapse

INVISIBLE_SEPARATOR =
"\u2063"
WORD_BOUNDARY =
Regexp.new('\b')
NOT_WORD_BOUNDARY =
Regexp.new("[\u2063]*([\\\"'\\.,\\!\\@\\-\\:\\;\\$\\?\\(\\)\/])[\u2063]*")

Class Method Summary collapse

Class Method Details

.tokenize(text) ⇒ Object

replace word boundaries with ‘invisible separator’ strip invisible separators from non-word boundaries replace spaces or invisible separators with a single space trim split words on single space



13
14
15
16
17
18
19
# File 'lib/boilerpipe/util/unicode_tokenizer.rb', line 13

def self.tokenize(text)
  text.gsub(WORD_BOUNDARY, INVISIBLE_SEPARATOR)
    .gsub(NOT_WORD_BOUNDARY, '\1')
    .gsub(/[ \u2063]+/, ' ')
    .strip
    .split(/[ ]+/)
end