Class: Reclassifier::Bayes

Inherits:
Object
  • Object
show all
Includes:
WordHash
Defined in:
lib/reclassifier/bayes.rb

Overview

Bayesian classifier for arbitrary text.

Implementation is translated from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press. 2008, ISBN 0521865719.

Derived quantities are cached to improve performance of repeated #classify calls.

Constant Summary

Constants included from WordHash

WordHash::CORPUS_SKIP_WORDS

Instance Method Summary collapse

Methods included from WordHash

#clean_word_hash, #word_hash, #word_hash_for_words

Constructor Details

#initialize(classifications = [], options = {}) ⇒ Bayes

Can be created with zero or more classifications, each of which will be initialized and given a training method. The classifications are specified as an array of symbols. Options are specified in a hash.

Options:

  • :clean - If false, punctuation will be included in the classifier. Otherwise, punctuation will be omitted. Default is true.

b = Reclassifier::Bayes.new([:interesting, :uninteresting, :spam], :clean => true)


24
25
26
27
28
29
30
# File 'lib/reclassifier/bayes.rb', line 24

def initialize(classifications = [], options = {})
  @classifications = {}
  @docs_in_classification_count = {}
  @options = options

  classifications.each {|classification| add_classification(classification)}
end

Instance Method Details

#add_classification(classification) ⇒ Object

Adds the classification to the classifier. Has no effect if the classification already existed. Returns the classification.

b.add_classification(:not_spam)
=>  :not_spam


127
128
129
130
131
132
133
# File 'lib/reclassifier/bayes.rb', line 127

def add_classification(classification)
  @classifications[classification] ||= {}

  @docs_in_classification_count[classification] ||= 0

  classification
end

#cache_set?Boolean

Returns true if the cache has been set (i.e. #classify has been run). Returns false otherwise.

classifier = Reclassifier::Bayes.new([:one, :other])

classifier.cache_set?
=>  false

classifier.train(:one, 'bbb')
classifier.train(:other, 'aaa')

classifier.classify('aaa')

classifier.cache_set?
=>  true

Returns:

  • (Boolean)


191
192
193
# File 'lib/reclassifier/bayes.rb', line 191

def cache_set?
  @cache.present?
end

#calculate_scores(text) ⇒ Object

Returns the scores of the specified text for each classification.

b.calculate_scores("I hate bad words and you")
=>  {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}

The largest of these scores (the one closest to 0) is the one picked out by #classify



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/reclassifier/bayes.rb', line 76

def calculate_scores(text)
  scores = {}

  @cache[:total_docs_classified_log] ||= Math.log(@docs_in_classification_count.values.reduce(:+))
  @cache[:words_classified] ||= @classifications.values.reduce(Set.new) {|set, word_counts| set.merge(word_counts.keys)}

  @classifications.each do |classification, classification_word_counts|
    # prior
    scores[classification] = Math.log(@docs_in_classification_count[classification])
    scores[classification] -= @cache[:total_docs_classified_log]

    # likelihood
    classification_word_count = classification_word_counts.values.reduce(:+).to_i
    smart_word_hash(text).each do |word, count|
      if @cache[:words_classified].include?(word)
        scores[classification] += count * Math.log((classification_word_counts[word] || 0) + 1)

        scores[classification] -= count * Math.log(classification_word_count + @cache[:words_classified].count)
      end
    end
  end

  scores
end

#classificationsObject

Provides a list of classification names

b.classifications
=>   [:this, :that, :the_other]


116
117
118
# File 'lib/reclassifier/bayes.rb', line 116

def classifications
  @classifications.keys
end

#classify(text) ⇒ Object

Returns the classification of the specified text, which is one of the classifications given in the initializer.

b.classify("I hate bad words and you")
=>  :uninteresting


107
108
109
# File 'lib/reclassifier/bayes.rb', line 107

def classify(text)
  calculate_scores(text.to_s).max_by {|classification| classification[1]}[0]
end

#invalidate_cacheObject

Invalidates the cache.

classifier = Reclassifier::Bayes.new([:one, :other])

classifier.train(:one, 'bbb')
classifier.train(:other, 'aaa')

classifier.classify('aaa')

classifier.cache_set?
=>  true

classifier.invalidate_cache

classifier.cache_set?
=>  false


171
172
173
# File 'lib/reclassifier/bayes.rb', line 171

def invalidate_cache
  @cache = {}
end

#remove_classification(classification) ⇒ Object

Removes the classification from the classifier. Returns the classifier if the classification existed, else nil.

b.remove_classification(:not_spam)
=>  :not_spam


142
143
144
145
146
147
148
149
150
151
152
# File 'lib/reclassifier/bayes.rb', line 142

def remove_classification(classification)
  return_value = if @classifications.include?(classification)
                   classification
                 else
                   nil
                 end

  @classifications.delete(classification)

  return_value
end

#train(classification, text) ⇒ Object

Provides a general training method for all classifications specified in Bayes#new

b = Reclassifier::Bayes.new([:this, :that])
b.train(:this, "This text")
b.train(:that, "That text")


39
40
41
42
43
44
45
46
47
48
49
# File 'lib/reclassifier/bayes.rb', line 39

def train(classification, text)
  ensure_classification_exists(classification)

  update_doc_count(classification, 1)

  smart_word_hash(text).each do |word, count|
    @classifications[classification][word] ||= 0

    @classifications[classification][word] += count
  end
end

#untrain(classification, text) ⇒ Object

Untrain a (classification, text) pair. Be very careful with this method.

b = Reclassifier::Bayes.new([:this, :that])
b.train(:this, "This text")
b.untrain(:this, "This text")


59
60
61
62
63
64
65
66
67
# File 'lib/reclassifier/bayes.rb', line 59

def untrain(classification, text)
  ensure_classification_exists(classification)

  update_doc_count(classification, -1)

  smart_word_hash(text).each do |word, count|
    @classifications[classification][word] -= count if @classifications[classification].include?(word)
  end
end