Class: TactfulTokenizer::Model
- Inherits:
-
Object
- Object
- TactfulTokenizer::Model
- Defined in:
- lib/tactful_tokenizer.rb
Overview
A model stores normalized probabilities of different features occuring.
Instance Attribute Summary collapse
-
#feats ⇒ Object
feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..
-
#lower_words ⇒ Object
feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..
-
#non_abbrs ⇒ Object
feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..
Instance Method Summary collapse
-
#classify(doc) ⇒ Object
Assign a prediction (probability, to be precise) to each sentence fragment.
-
#featurize(doc) ⇒ Object
Get the features of every fragment.
-
#get_features(frag, model) ⇒ Object
Finds the features in a text fragment of the form: …
-
#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model
constructor
Initialize the model.
-
#tokenize_text(text) ⇒ Object
This function is the only one that’ll end up being used.
Constructor Details
#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model
Initialize the model. feats, lower_words, and non_abbrs indicate the locations of the respective Marshal dumps.
52 53 54 55 56 57 58 59 |
# File 'lib/tactful_tokenizer.rb', line 52 def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar") @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file| File.open(file) do |f| Marshal.load(f.read) end end @p0 = @feats["<prior>"] ** 4 end |
Instance Attribute Details
#feats ⇒ Object
feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.
64 65 66 |
# File 'lib/tactful_tokenizer.rb', line 64 def feats @feats end |
#lower_words ⇒ Object
feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.
64 65 66 |
# File 'lib/tactful_tokenizer.rb', line 64 def lower_words @lower_words end |
#non_abbrs ⇒ Object
feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.
64 65 66 |
# File 'lib/tactful_tokenizer.rb', line 64 def non_abbrs @non_abbrs end |
Instance Method Details
#classify(doc) ⇒ Object
Assign a prediction (probability, to be precise) to each sentence fragment. For each feature in each fragment we hunt up the normalized probability and multiply. This is a fairly straightforward Bayesian probabilistic algorithm.
80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/tactful_tokenizer.rb', line 80 def classify(doc) frag = nil probs = 1 feat = '' doc.frags.each do |frag| probs = @p0 frag.features.each do |feat| probs *= @feats[feat] end frag.pred = probs / (probs + 1) end end |
#featurize(doc) ⇒ Object
Get the features of every fragment.
94 95 96 97 98 99 |
# File 'lib/tactful_tokenizer.rb', line 94 def featurize(doc) frag = nil doc.frags.each do |frag| get_features(frag, self) end end |
#get_features(frag, model) ⇒ Object
Finds the features in a text fragment of the form: … w1. (sb?) w2 … Features listed in rough order of importance:
-
w1: a word that includes a period.
-
w2: the next word, if it exists.
-
w1length: the number of alphabetic characters in w1.
-
both: w1 and w2 taken together.
-
w1abbr: logarithmic count of w1 occuring without a period.
-
w2lower: logarithmiccount of w2 occuring lowercased.
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# File 'lib/tactful_tokenizer.rb', line 110 def get_features(frag, model) w1 = (frag.cleaned.last or '') w2 = (frag.next or '') frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"] if not w2.empty? if w1.chop.is_alphabetic? frag.features.push "w1length_#{[10, w1.length].min}" frag.features.push "w1abbr_#{model.non_abbrs[w1.chop]}" end if w2.chop.is_alphabetic? frag.features.push "w2cap_#{w2[0].is_upper_case?}" frag.features.push "w2lower_#{model.lower_words[w2.downcase]}" end end end |
#tokenize_text(text) ⇒ Object
This function is the only one that’ll end up being used. m = TactfulTokenizer::Model.new m.tokenize_text(“Hey, are these two sentences? I bet they should be.”)
> [“Hey, are these two sentences?”, “I bet they should be.”]
70 71 72 73 74 75 |
# File 'lib/tactful_tokenizer.rb', line 70 def tokenize_text(text) data = Doc.new(text) featurize(data) classify(data) return data.segment end |