Class: TactfulTokenizer::Frag

Inherits:

Object

Object
TactfulTokenizer::Frag

Defined in:: lib/tactful_tokenizer.rb

Overview

A fragment is a potential sentence, but is based only on the existence of a period. The text “Here in the U.S. Senate we prefer to devour our friends.” will be split into “Here in the U.S.” and “Senate we prefer to devour our friends.”

Instance Attribute Summary collapse

#cleaned ⇒ Object

orig = The original text of the fragment.
#features ⇒ Object

orig = The original text of the fragment.
#next ⇒ Object

orig = The original text of the fragment.
#orig ⇒ Object

orig = The original text of the fragment.
#pred ⇒ Object

orig = The original text of the fragment.

Instance Method Summary collapse

#clean(s) ⇒ Object

Normalizes numbers and discards ambiguous punctuation.
#initialize(orig = '') ⇒ Frag constructor

Create a new fragment.
#push_w1_features(w1, model) ⇒ Object
#push_w2_features(w2, model) ⇒ Object

Constructor Details

#initialize(orig = '') ⇒ `Frag`

Create a new fragment.

# File 'lib/tactful_tokenizer.rb', line 182

def initialize(orig='')
  @orig = orig
  clean(orig)
  @next, @pred, @features = nil, nil, nil
end

Instance Attribute Details

#cleaned ⇒ `Object`

orig = The original text of the fragment. next = The next word following the fragment. cleaned = Array of the fragment’s words after cleaning. pred = Probability that the fragment is a sentence. features = Array of the fragment’s features.



179
180
181

# File 'lib/tactful_tokenizer.rb', line 179

def cleaned
  @cleaned
end

#features ⇒ `Object`



179
180
181

# File 'lib/tactful_tokenizer.rb', line 179

def features
  @features
end

#next ⇒ `Object`



179
180
181

# File 'lib/tactful_tokenizer.rb', line 179

def next
  @next
end

#orig ⇒ `Object`



179
180
181

# File 'lib/tactful_tokenizer.rb', line 179

def orig
  @orig
end

#pred ⇒ `Object`



179
180
181

# File 'lib/tactful_tokenizer.rb', line 179

def pred
  @pred
end

Instance Method Details

#clean(s) ⇒ `Object`

Normalizes numbers and discards ambiguous punctuation. And then splits into an array, because realistically only the last and first words are ever accessed.

# File 'lib/tactful_tokenizer.rb', line 190

def clean(s)
  @cleaned = String.new(s)
  tokenize(@cleaned)
  @cleaned.gsub!(/[.,\d]*\d/, '<NUM>')
  @cleaned.gsub!(/[^[[:upper:][:lower:]]\d[:space:],!?.;:<>\-'\/$% ]/u, '')
  @cleaned.gsub!('--', ' ')
  @cleaned = @cleaned.split
end

#push_w1_features(w1, model) ⇒ `Object`

# File 'lib/tactful_tokenizer.rb', line 199

def push_w1_features w1, model
  if w1.chop.is_alphabetic? 
    features.push "w1length_#{[10, w1.length].min}", "w1abbr_#{model.non_abbrs[w1.chop]}"
  end
end

#push_w2_features(w2, model) ⇒ `Object`

# File 'lib/tactful_tokenizer.rb', line 205

def push_w2_features w2, model
  if w2.chop.is_alphabetic?
    features.push "w2cap_#{w2[0,1].is_upper_case?}", "w2lower_#{model.lower_words[w2.downcase]}"
  end
end

Class: TactfulTokenizer::Frag

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(orig = '') ⇒ Frag

Instance Attribute Details

#cleaned ⇒ Object

#features ⇒ Object

#next ⇒ Object

#orig ⇒ Object

#pred ⇒ Object