Class: TactfulTokenizer::Model

Inherits:
Object
  • Object
show all
Defined in:
lib/tactful_tokenizer.rb

Overview

A model stores normalized probabilities of different features occuring.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model

Initialize the model. feats, lower_words, and non_abbrs indicate the locations of the respective Marshal dumps.


52
53
54
55
56
57
58
59
# File 'lib/tactful_tokenizer.rb', line 52

def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar")
  @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file|
    File.open(file) do |f|
      Marshal.load(f.read)
    end
  end
  @p0 = @feats["<prior>"] ** 4  
end

Instance Attribute Details

#featsObject

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.


64
65
66
# File 'lib/tactful_tokenizer.rb', line 64

def feats
  @feats
end

#lower_wordsObject

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.


64
65
66
# File 'lib/tactful_tokenizer.rb', line 64

def lower_words
  @lower_words
end

#non_abbrsObject

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.


64
65
66
# File 'lib/tactful_tokenizer.rb', line 64

def non_abbrs
  @non_abbrs
end

Instance Method Details

#classify(doc) ⇒ Object

Assign a prediction (probability, to be precise) to each sentence fragment. For each feature in each fragment we hunt up the normalized probability and multiply. This is a fairly straightforward Bayesian probabilistic algorithm.


80
81
82
83
84
85
86
87
88
89
# File 'lib/tactful_tokenizer.rb', line 80

def classify(doc)
  frag, probs, feat = nil, nil, nil
  doc.frags.each do |frag|
    probs = @p0
    frag.features.each do |feat|
      probs *= @feats[feat]
    end
    frag.pred = probs / (probs + 1)
  end
end

#featurize(doc) ⇒ Object

Get the features of every fragment.


92
93
94
95
96
97
# File 'lib/tactful_tokenizer.rb', line 92

def featurize(doc)
  frag = nil
  doc.frags.each do |frag|
    get_features(frag, self)
  end
end

#get_features(frag, model) ⇒ Object

Finds the features in a text fragment of the form: … w1. (sb?) w2 … Features listed in rough order of importance:

  • w1: a word that includes a period.

  • w2: the next word, if it exists.

  • w1length: the number of alphabetic characters in w1.

  • both: w1 and w2 taken together.

  • w1abbr: logarithmic count of w1 occuring without a period.

  • w2lower: logarithmiccount of w2 occuring lowercased.


108
109
110
111
112
113
114
115
116
117
118
# File 'lib/tactful_tokenizer.rb', line 108

def get_features(frag, model)
  w1 = (frag.cleaned.last or '')
  w2 = (frag.next or '')

  frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"]

  unless w2.empty?
    frag.push_w1_features(w1, model)
    frag.push_w2_features(w2, model)
  end
end

#tokenize_text(text) ⇒ Object

This function is the only one that'll end up being used. m = TactfulTokenizer::Model.new m.tokenize_text(“Hey, are these two sentences? I bet they should be.”)

> [“Hey, are these two sentences?”, “I bet they should be.”]


70
71
72
73
74
75
# File 'lib/tactful_tokenizer.rb', line 70

def tokenize_text(text)
  data = Doc.new(text)
  featurize(data)
  classify(data)
  return data.segment
end