Class: TactfulTokenizer::Model

Inherits:
Object
  • Object
show all
Defined in:
lib/tactful_tokenizer.rb

Overview

A model stores normalized probabilities of different features occuring.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model

Initialize the model. feats, lower_words, and non_abbrs indicate the locations of the respective Marshal dumps.



52
53
54
55
56
57
58
59
# File 'lib/tactful_tokenizer.rb', line 52

def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar")
    @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file|
        File.open(file) do |f|
            Marshal.load(f.read)
        end
    end
    @p0 = @feats["<prior>"] ** 4  
end

Instance Attribute Details

#featsObject

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



64
65
66
# File 'lib/tactful_tokenizer.rb', line 64

def feats
  @feats
end

#lower_wordsObject

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



64
65
66
# File 'lib/tactful_tokenizer.rb', line 64

def lower_words
  @lower_words
end

#non_abbrsObject

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



64
65
66
# File 'lib/tactful_tokenizer.rb', line 64

def non_abbrs
  @non_abbrs
end

Instance Method Details

#classify(doc) ⇒ Object

Assign a prediction (probability, to be precise) to each sentence fragment. For each feature in each fragment we hunt up the normalized probability and multiply. This is a fairly straightforward Bayesian probabilistic algorithm.



80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/tactful_tokenizer.rb', line 80

def classify(doc)
    frag = nil
    probs = 1
    feat = ''
    doc.frags.each do |frag|
        probs = @p0
        frag.features.each do |feat|
            probs *= @feats[feat]
        end
        frag.pred = probs / (probs + 1)
    end
end

#featurize(doc) ⇒ Object

Get the features of every fragment.



94
95
96
97
98
99
# File 'lib/tactful_tokenizer.rb', line 94

def featurize(doc)
    frag = nil
    doc.frags.each do |frag|
        get_features(frag, self)
    end
end

#get_features(frag, model) ⇒ Object

Finds the features in a text fragment of the form: … w1. (sb?) w2 … Features listed in rough order of importance:

  • w1: a word that includes a period.

  • w2: the next word, if it exists.

  • w1length: the number of alphabetic characters in w1.

  • both: w1 and w2 taken together.

  • w1abbr: logarithmic count of w1 occuring without a period.

  • w2lower: logarithmiccount of w2 occuring lowercased.



110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# File 'lib/tactful_tokenizer.rb', line 110

def get_features(frag, model)
    w1 = (frag.cleaned.last or '')
    w2 = (frag.next or '')

    frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"]

    if not w2.empty?
        if w1.chop.is_alphabetic? 
            frag.features.push "w1length_#{[10, w1.length].min}"
            frag.features.push "w1abbr_#{model.non_abbrs[w1.chop]}"
        end

        if w2.chop.is_alphabetic?
            frag.features.push "w2cap_#{w2[0].is_upper_case?}"
            frag.features.push "w2lower_#{model.lower_words[w2.downcase]}"
        end
    end
end

#tokenize_text(text) ⇒ Object

This function is the only one that’ll end up being used. m = TactfulTokenizer::Model.new m.tokenize_text(“Hey, are these two sentences? I bet they should be.”)

> [“Hey, are these two sentences?”, “I bet they should be.”]



70
71
72
73
74
75
# File 'lib/tactful_tokenizer.rb', line 70

def tokenize_text(text)
    data = Doc.new(text)
    featurize(data)
    classify(data)
    return data.segment
end