Class: TactfulTokenizer::Model

Inherits:

Object

Object
TactfulTokenizer::Model

show all

Defined in:: lib/tactful_tokenizer.rb

Overview

A model stores normalized probabilities of different features occuring.

Instance Attribute Summary collapse

#feats ⇒ Object

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..
#lower_words ⇒ Object

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..
#non_abbrs ⇒ Object

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..

Instance Method Summary collapse

#classify(doc) ⇒ Object

Assign a prediction (probability, to be precise) to each sentence fragment.
#featurize(doc) ⇒ Object

Get the features of every fragment.
#get_features(frag, model) ⇒ Object

Finds the features in a text fragment of the form: …
#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model constructor

Initialize the model.
#tokenize_text(text) ⇒ Object

This function is the only one that’ll end up being used.

Constructor Details

#initialize(feats = "#{File.dirname(FILE)}/models/features.mar", lower_words = "#{File.dirname(FILE)}/models/lower_words.mar", non_abbrs = "#{File.dirname(FILE)}/models/non_abbrs.mar") ⇒ `Model`

Initialize the model. feats, lower_words, and non_abbrs indicate the locations of the respective Marshal dumps.

# File 'lib/tactful_tokenizer.rb', line 51

def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar")
  @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file|
    File.open(file) do |f|
      Marshal.load(f.read)
    end
  end
  @p0 = @feats["<prior>"] ** 4  
end

Instance Attribute Details

#feats ⇒ `Object`

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



63
64
65

# File 'lib/tactful_tokenizer.rb', line 63

def feats
  @feats
end

#lower_words ⇒ `Object`

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



63
64
65

# File 'lib/tactful_tokenizer.rb', line 63

def lower_words
  @lower_words
end

#non_abbrs ⇒ `Object`

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



63
64
65

# File 'lib/tactful_tokenizer.rb', line 63

def non_abbrs
  @non_abbrs
end

Instance Method Details

#classify(doc) ⇒ `Object`

Assign a prediction (probability, to be precise) to each sentence fragment. For each feature in each fragment we hunt up the normalized probability and multiply. This is a fairly straightforward Bayesian probabilistic algorithm.

# File 'lib/tactful_tokenizer.rb', line 79

def classify(doc)
  frag, probs, feat = nil, nil, nil
  doc.frags.each do |frag|
    probs = @p0
    frag.features.each do |feat|
      probs *= @feats[feat]
    end
    frag.pred = probs / (probs + 1)
  end
end

#featurize(doc) ⇒ `Object`

Get the features of every fragment.

# File 'lib/tactful_tokenizer.rb', line 91

def featurize(doc)
  frag = nil
  doc.frags.each do |frag|
    get_features(frag, self)
  end
end

#get_features(frag, model) ⇒ `Object`

Finds the features in a text fragment of the form: … w1. (sb?) w2 … Features listed in rough order of importance:

w1: a word that includes a period.
w2: the next word, if it exists.
w1length: the number of alphabetic characters in w1.
both: w1 and w2 taken together.
w1abbr: logarithmic count of w1 occuring without a period.
w2lower: logarithmiccount of w2 occuring lowercased.

# File 'lib/tactful_tokenizer.rb', line 107

def get_features(frag, model)
  w1 = (frag.cleaned.last or '')
  w2 = (frag.next or '')

  frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"]

  unless w2.empty?
    frag.push_w1_features(w1, model)
    frag.push_w2_features(w2, model)
  end
end

#tokenize_text(text) ⇒ `Object`

This function is the only one that’ll end up being used. m = TactfulTokenizer::Model.new m.tokenize_text(“Hey, are these two sentences? I bet they should be.”)

> [“Hey, are these two sentences?”, “I bet they should be.”]

# File 'lib/tactful_tokenizer.rb', line 69

def tokenize_text(text)
  data = Doc.new(text)
  featurize(data)
  classify(data)
  return data.segment
end

Class: TactfulTokenizer::Model

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model

Instance Attribute Details

#feats ⇒ Object

#lower_words ⇒ Object

#non_abbrs ⇒ Object

Instance Method Details

#classify(doc) ⇒ Object

#featurize(doc) ⇒ Object

#get_features(frag, model) ⇒ Object

#tokenize_text(text) ⇒ Object

> [“Hey, are these two sentences?”, “I bet they should be.”]

#initialize(feats = "#{File.dirname(FILE)}/models/features.mar", lower_words = "#{File.dirname(FILE)}/models/lower_words.mar", non_abbrs = "#{File.dirname(FILE)}/models/non_abbrs.mar") ⇒ `Model`

#feats ⇒ `Object`

#lower_words ⇒ `Object`

#non_abbrs ⇒ `Object`

#classify(doc) ⇒ `Object`

#featurize(doc) ⇒ `Object`

#get_features(frag, model) ⇒ `Object`

#tokenize_text(text) ⇒ `Object`