Class: TactfulTokenizer::Frag
- Inherits:
-
Object
- Object
- TactfulTokenizer::Frag
- Defined in:
- lib/tactful_tokenizer.rb
Overview
A fragment is a potential sentence, but is based only on the existence of a period. The text “Here in the U.S. Senate we prefer to devour our friends.” will be split into “Here in the U.S.” and “Senate we prefer to devour our friends.”
Instance Attribute Summary collapse
-
#cleaned ⇒ Object
orig = The original text of the fragment.
-
#features ⇒ Object
orig = The original text of the fragment.
-
#next ⇒ Object
orig = The original text of the fragment.
-
#orig ⇒ Object
orig = The original text of the fragment.
-
#pred ⇒ Object
orig = The original text of the fragment.
Instance Method Summary collapse
-
#clean(s) ⇒ Object
Normalizes numbers and discards ambiguous punctuation.
-
#initialize(orig = '') ⇒ Frag
constructor
Create a new fragment.
- #push_w1_features(w1, model) ⇒ Object
- #push_w2_features(w2, model) ⇒ Object
Constructor Details
#initialize(orig = '') ⇒ Frag
Create a new fragment.
182 183 184 185 186 |
# File 'lib/tactful_tokenizer.rb', line 182 def initialize(orig='') @orig = orig clean(orig) @next, @pred, @features = nil, nil, nil end |
Instance Attribute Details
#cleaned ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
179 180 181 |
# File 'lib/tactful_tokenizer.rb', line 179 def cleaned @cleaned end |
#features ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
179 180 181 |
# File 'lib/tactful_tokenizer.rb', line 179 def features @features end |
#next ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
179 180 181 |
# File 'lib/tactful_tokenizer.rb', line 179 def next @next end |
#orig ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
179 180 181 |
# File 'lib/tactful_tokenizer.rb', line 179 def orig @orig end |
#pred ⇒ Object
orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.
179 180 181 |
# File 'lib/tactful_tokenizer.rb', line 179 def pred @pred end |
Instance Method Details
#clean(s) ⇒ Object
Normalizes numbers and discards ambiguous punctuation. And then splits into an array, because realistically only the last and first words are ever accessed.
190 191 192 193 194 195 196 197 |
# File 'lib/tactful_tokenizer.rb', line 190 def clean(s) @cleaned = String.new(s) tokenize(@cleaned) @cleaned.gsub!(/[.,\d]*\d/, '<NUM>') @cleaned.gsub!(/[^[[:upper:][:lower:]]\d[:space:],!?.;:<>\-'\/$% ]/u, '') @cleaned.gsub!('--', ' ') @cleaned = @cleaned.split end |
#push_w1_features(w1, model) ⇒ Object
199 200 201 202 203 |
# File 'lib/tactful_tokenizer.rb', line 199 def push_w1_features w1, model if w1.chop.is_alphabetic? features.push "w1length_#{[10, w1.length].min}", "w1abbr_#{model.non_abbrs[w1.chop]}" end end |
#push_w2_features(w2, model) ⇒ Object
205 206 207 208 209 |
# File 'lib/tactful_tokenizer.rb', line 205 def push_w2_features w2, model if w2.chop.is_alphabetic? features.push "w2cap_#{w2[0,1].is_upper_case?}", "w2lower_#{model.lower_words[w2.downcase]}" end end |