Class: TactfulTokenizer::Model

Inherits:

Object

Object
TactfulTokenizer::Model

show all

Defined in:: lib/tactful_tokenizer.rb

Overview

A model stores normalized probabilities of different features occuring.

Instance Attribute Summary collapse

#feats ⇒ Object

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..
#lower_words ⇒ Object

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..
#non_abbrs ⇒ Object

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv..

Instance Method Summary collapse

#classify(doc) ⇒ Object

Assign a prediction (probability, to be precise) to each sentence fragment.
#featurize(doc) ⇒ Object

Get the features of every fragment.
#get_features(frag, model) ⇒ Object

Finds the features in a text fragment of the form: …
#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model constructor

Initialize the model.
#tokenize_text(text) ⇒ Object

This function is the only one that’ll end up being used.

Constructor Details

#initialize(feats = "#{File.dirname(FILE)}/models/features.mar", lower_words = "#{File.dirname(FILE)}/models/lower_words.mar", non_abbrs = "#{File.dirname(FILE)}/models/non_abbrs.mar") ⇒ `Model`

Initialize the model. feats, lower_words, and non_abbrs indicate the locations of the respective Marshal dumps.

# File 'lib/tactful_tokenizer.rb', line 52

def initialize(feats="#{File.dirname(__FILE__)}/models/features.mar", lower_words="#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs="#{File.dirname(__FILE__)}/models/non_abbrs.mar")
    @feats, @lower_words, @non_abbrs = [feats, lower_words, non_abbrs].map do |file|
        File.open(file) do |f|
            Marshal.load(f.read)
        end
    end
    @p0 = @feats["<prior>"] ** 4  
end

Instance Attribute Details

#feats ⇒ `Object`

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



64
65
66

# File 'lib/tactful_tokenizer.rb', line 64

def feats
  @feats
end

#lower_words ⇒ `Object`

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



64
65
66

# File 'lib/tactful_tokenizer.rb', line 64

def lower_words
  @lower_words
end

#non_abbrs ⇒ `Object`

feats = => normalized probability of feature lower_words = => log count of occurences in lower case non_abbrs = => log count of occurences when not an abbrv.



64
65
66

# File 'lib/tactful_tokenizer.rb', line 64

def non_abbrs
  @non_abbrs
end

Instance Method Details

#classify(doc) ⇒ `Object`

Assign a prediction (probability, to be precise) to each sentence fragment. For each feature in each fragment we hunt up the normalized probability and multiply. This is a fairly straightforward Bayesian probabilistic algorithm.

# File 'lib/tactful_tokenizer.rb', line 80

def classify(doc)
    frag = nil
    probs = 1
    feat = ''
    doc.frags.each do |frag|
        probs = @p0
        frag.features.each do |feat|
            probs *= @feats[feat]
        end
        frag.pred = probs / (probs + 1)
    end
end

#featurize(doc) ⇒ `Object`

Get the features of every fragment.

# File 'lib/tactful_tokenizer.rb', line 94

def featurize(doc)
    frag = nil
    doc.frags.each do |frag|
        get_features(frag, self)
    end
end

#get_features(frag, model) ⇒ `Object`

Finds the features in a text fragment of the form: … w1. (sb?) w2 … Features listed in rough order of importance:

w1: a word that includes a period.
w2: the next word, if it exists.
w1length: the number of alphabetic characters in w1.
both: w1 and w2 taken together.
w1abbr: logarithmic count of w1 occuring without a period.
w2lower: logarithmiccount of w2 occuring lowercased.

# File 'lib/tactful_tokenizer.rb', line 110

def get_features(frag, model)
    w1 = (frag.cleaned.last or '')
    w2 = (frag.next or '')

    frag.features = ["w1_#{w1}", "w2_#{w2}", "both_#{w1}_#{w2}"]

    if not w2.empty?
        if w1.chop.is_alphabetic? 
            frag.features.push "w1length_#{[10, w1.length].min}"
            frag.features.push "w1abbr_#{model.non_abbrs[w1.chop]}"
        end

        if w2.chop.is_alphabetic?
            frag.features.push "w2cap_#{w2[0].is_upper_case?}"
            frag.features.push "w2lower_#{model.lower_words[w2.downcase]}"
        end
    end
end

#tokenize_text(text) ⇒ `Object`

This function is the only one that’ll end up being used. m = TactfulTokenizer::Model.new m.tokenize_text(“Hey, are these two sentences? I bet they should be.”)

> [“Hey, are these two sentences?”, “I bet they should be.”]

# File 'lib/tactful_tokenizer.rb', line 70

def tokenize_text(text)
    data = Doc.new(text)
    featurize(data)
    classify(data)
    return data.segment
end

Class: TactfulTokenizer::Model

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(feats = "#{File.dirname(__FILE__)}/models/features.mar", lower_words = "#{File.dirname(__FILE__)}/models/lower_words.mar", non_abbrs = "#{File.dirname(__FILE__)}/models/non_abbrs.mar") ⇒ Model

Instance Attribute Details

#feats ⇒ Object

#lower_words ⇒ Object

#non_abbrs ⇒ Object

Instance Method Details

#classify(doc) ⇒ Object

#featurize(doc) ⇒ Object

#get_features(frag, model) ⇒ Object

#tokenize_text(text) ⇒ Object

> [“Hey, are these two sentences?”, “I bet they should be.”]

#initialize(feats = "#{File.dirname(FILE)}/models/features.mar", lower_words = "#{File.dirname(FILE)}/models/lower_words.mar", non_abbrs = "#{File.dirname(FILE)}/models/non_abbrs.mar") ⇒ `Model`

#feats ⇒ `Object`

#lower_words ⇒ `Object`

#non_abbrs ⇒ `Object`

#classify(doc) ⇒ `Object`

#featurize(doc) ⇒ `Object`

#get_features(frag, model) ⇒ `Object`

#tokenize_text(text) ⇒ `Object`