Class: Tokenizer::WhitespaceTokenizer
- Inherits:
-
Object
- Object
- Tokenizer::WhitespaceTokenizer
- Defined in:
- lib/tokenizer/tokenizer.rb
Overview
Simple whitespace based tokenizer with configurable punctuation detection.
Direct Known Subclasses
Constant Summary collapse
- FS =
Default whitespace separator.
Regexp.new('[[:blank:]]+')
- SIMPLE_PRE =
Characters only in the role of splittable prefixes.
['¿', '¡']
- SIMPLE_POST =
Characters only in the role of splittable suffixes.
['!', '?', ',', ':', ';', '.']
- PAIR_PRE =
Characters as splittable prefixes with an optional matching suffix.
['(', '{', '[', '<', '«', '„']
- PAIR_POST =
Characters as splittable suffixes with an optional matching prefix.
[')', '}', ']', '>', '»', '“']
- PRE_N_POST =
Characters which can be both prefixes AND suffixes.
['"', "'"]
Instance Method Summary collapse
-
#initialize(lang = :de, options = {}) ⇒ WhitespaceTokenizer
constructor
A new instance of WhitespaceTokenizer.
-
#sanitize_input(str) ⇒ String
private
A new modified string.
-
#tokenize(str) ⇒ Array<String>
(also: #process)
Array of tokens.
Constructor Details
#initialize(lang = :de, options = {}) ⇒ WhitespaceTokenizer
Returns a new instance of WhitespaceTokenizer.
35 36 37 38 39 40 41 42 |
# File 'lib/tokenizer/tokenizer.rb', line 35 def initialize(lang = :de, = {}) @lang = lang @options = { pre: SIMPLE_PRE + PAIR_PRE, post: SIMPLE_POST + PAIR_POST, pre_n_post: PRE_N_POST }.merge() end |
Instance Method Details
#sanitize_input(str) ⇒ String (private)
Returns A new modified string.
69 70 71 |
# File 'lib/tokenizer/tokenizer.rb', line 69 def sanitize_input(str) str.chomp.strip end |
#tokenize(str) ⇒ Array<String> Also known as: process
Returns Array of tokens.
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
# File 'lib/tokenizer/tokenizer.rb', line 46 def tokenize(str) tokens = sanitize_input(str).split(FS) return [''] if tokens.empty? splittables = SIMPLE_PRE + SIMPLE_POST + PAIR_PRE + PAIR_POST + PRE_N_POST pattern = Regexp.new("[^#{Regexp.escape(splittables.join)}]+") output = [] tokens.each do |token| prefix, stem, suffix = token.partition(pattern) output << prefix.split('') unless prefix.empty? output << stem unless stem.empty? output << suffix.split('') unless suffix.empty? end output.flatten end |