Class: TactfulTokenizer::Frag
- Inherits:
-
Object
- Object
- TactfulTokenizer::Frag
- Defined in:
- lib/tactful_tokenizer.rb
Overview
A fragment is a potential sentence, but is based only on the existence of a period. The text “Here in the U.S. Senate we prefer to devour our friends.” will be split into “Here in the U.S.” and “Senate we prefer to devour our friends.”
Instance Attribute Summary collapse
-
#cleaned ⇒ Object
orig = The original text of the fragment.
-
#features ⇒ Object
orig = The original text of the fragment.
-
#next ⇒ Object
orig = The original text of the fragment.
-
#orig ⇒ Object
orig = The original text of the fragment.
-
#pred ⇒ Object
orig = The original text of the fragment.
Instance Method Summary collapse
-
#clean(s) ⇒ Object
Normalizes numbers and discards ambiguous punctuation.
-
#initialize(orig = '') ⇒ Frag
constructor
Create a new fragment.
- #push_w1_features(w1, model) ⇒ Object
- #push_w2_features(w2, model) ⇒ Object
Constructor Details
#initialize(orig = '') ⇒ Frag
Create a new fragment.
181 182 183 184 185 |
# File 'lib/tactful_tokenizer.rb', line 181 def initialize(orig='') @orig = orig clean(orig) @next, @pred, @features = nil, nil, nil end |
Instance Attribute Details
#cleaned ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
178 179 180 |
# File 'lib/tactful_tokenizer.rb', line 178 def cleaned @cleaned end |
#features ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
178 179 180 |
# File 'lib/tactful_tokenizer.rb', line 178 def features @features end |
#next ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
178 179 180 |
# File 'lib/tactful_tokenizer.rb', line 178 def next @next end |
#orig ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
178 179 180 |
# File 'lib/tactful_tokenizer.rb', line 178 def orig @orig end |
#pred ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
178 179 180 |
# File 'lib/tactful_tokenizer.rb', line 178 def pred @pred end |
Instance Method Details
#clean(s) ⇒ Object
Normalizes numbers and discards ambiguous punctuation. And then splits into an array, because realistically only the last and first words are ever accessed.
189 190 191 192 193 194 195 196 |
# File 'lib/tactful_tokenizer.rb', line 189 def clean(s) @cleaned = String.new(s) tokenize(@cleaned) @cleaned.gsub!(/[.,\d]*\d/, '<NUM>') @cleaned.gsub!(/[^[[:upper:][:lower:]]\d[:space:],!?.;:<>\-'\/$% ]/u, '') @cleaned.gsub!('--', ' ') @cleaned = @cleaned.split end |
#push_w1_features(w1, model) ⇒ Object
198 199 200 201 202 |
# File 'lib/tactful_tokenizer.rb', line 198 def push_w1_features w1, model if w1.chop.is_alphabetic? features.push "w1length_#{[10, w1.length].min}", "w1abbr_#{model.non_abbrs[w1.chop]}" end end |
#push_w2_features(w2, model) ⇒ Object
204 205 206 207 208 |
# File 'lib/tactful_tokenizer.rb', line 204 def push_w2_features w2, model if w2.chop.is_alphabetic? features.push "w2cap_#{w2[0,1].is_upper_case?}", "w2lower_#{model.lower_words[w2.downcase]}" end end |