Class: TactfulTokenizer::Frag

Inherits:
Object
  • Object
show all
Defined in:
lib/tactful_tokenizer.rb

Overview

A fragment is a potential sentence, but is based only on the existence of a period. The text “Here in the U.S. Senate we prefer to devour our friends.” will be split into “Here in the U.S.” and “Senate we prefer to devour our friends.”

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(orig = '') ⇒ Frag

Create a new fragment.



181
182
183
184
185
# File 'lib/tactful_tokenizer.rb', line 181

def initialize(orig='')
  @orig = orig
  clean(orig)
  @next, @pred, @features = nil, nil, nil
end

Instance Attribute Details

#cleanedObject

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.



178
179
180
# File 'lib/tactful_tokenizer.rb', line 178

def cleaned
  @cleaned
end

#featuresObject

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.



178
179
180
# File 'lib/tactful_tokenizer.rb', line 178

def features
  @features
end

#nextObject

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.



178
179
180
# File 'lib/tactful_tokenizer.rb', line 178

def next
  @next
end

#origObject

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.



178
179
180
# File 'lib/tactful_tokenizer.rb', line 178

def orig
  @orig
end

#predObject

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.



178
179
180
# File 'lib/tactful_tokenizer.rb', line 178

def pred
  @pred
end

Instance Method Details

#clean(s) ⇒ Object

Normalizes numbers and discards ambiguous punctuation. And then splits into an array, because realistically only the last and first words are ever accessed.



189
190
191
192
193
194
195
196
# File 'lib/tactful_tokenizer.rb', line 189

def clean(s)
  @cleaned = String.new(s)
  tokenize(@cleaned)
  @cleaned.gsub!(/[.,\d]*\d/, '<NUM>')
  @cleaned.gsub!(/[^[[:upper:][:lower:]]\d[:space:],!?.;:<>\-'\/$% ]/u, '')
  @cleaned.gsub!('--', ' ')
  @cleaned = @cleaned.split
end

#push_w1_features(w1, model) ⇒ Object



198
199
200
201
202
# File 'lib/tactful_tokenizer.rb', line 198

def push_w1_features w1, model
  if w1.chop.is_alphabetic? 
    features.push "w1length_#{[10, w1.length].min}", "w1abbr_#{model.non_abbrs[w1.chop]}"
  end
end

#push_w2_features(w2, model) ⇒ Object



204
205
206
207
208
# File 'lib/tactful_tokenizer.rb', line 204

def push_w2_features w2, model
  if w2.chop.is_alphabetic?
    features.push "w2cap_#{w2[0,1].is_upper_case?}", "w2lower_#{model.lower_words[w2.downcase]}"
  end
end