Class: Reclassifier::Bayes
- Inherits:
-
Object
- Object
- Reclassifier::Bayes
- Includes:
- WordHash
- Defined in:
- lib/reclassifier/bayes.rb
Overview
Bayesian classifier for arbitrary text.
Implementation is translated from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press. 2008, ISBN 0521865719.
Derived quantities are cached to improve performance of repeated #classify calls.
Constant Summary
Constants included from WordHash
Instance Method Summary collapse
-
#add_classification(classification) ⇒ Object
Adds the classification to the classifier.
-
#cache_set? ⇒ Boolean
Returns true if the cache has been set (i.e. #classify has been run).
-
#calculate_scores(text) ⇒ Object
Returns the scores of the specified text for each classification.
-
#classifications ⇒ Object
Provides a list of classification names.
-
#classify(text) ⇒ Object
Returns the classification of the specified text, which is one of the classifications given in the initializer.
-
#initialize(classifications = [], options = {}) ⇒ Bayes
constructor
Can be created with zero or more classifications, each of which will be initialized and given a training method.
-
#invalidate_cache ⇒ Object
Invalidates the cache.
-
#remove_classification(classification) ⇒ Object
Removes the classification from the classifier.
-
#train(classification, text) ⇒ Object
Provides a general training method for all classifications specified in Bayes#new.
-
#untrain(classification, text) ⇒ Object
Untrain a (classification, text) pair.
Methods included from WordHash
#clean_word_hash, #word_hash, #word_hash_for_words
Constructor Details
#initialize(classifications = [], options = {}) ⇒ Bayes
Can be created with zero or more classifications, each of which will be initialized and given a training method. The classifications are specified as an array of symbols. Options are specified in a hash.
Options:
-
:clean - If false, punctuation will be included in the classifier. Otherwise, punctuation will be omitted. Default is true.
b = Reclassifier::Bayes.new([:interesting, :uninteresting, :spam], :clean => true)
24 25 26 27 28 29 30 |
# File 'lib/reclassifier/bayes.rb', line 24 def initialize(classifications = [], = {}) @classifications = {} @docs_in_classification_count = {} @options = classifications.each {|classification| add_classification(classification)} end |
Instance Method Details
#add_classification(classification) ⇒ Object
Adds the classification to the classifier. Has no effect if the classification already existed. Returns the classification.
b.add_classification(:not_spam)
=> :not_spam
127 128 129 130 131 132 133 |
# File 'lib/reclassifier/bayes.rb', line 127 def add_classification(classification) @classifications[classification] ||= {} @docs_in_classification_count[classification] ||= 0 classification end |
#cache_set? ⇒ Boolean
Returns true if the cache has been set (i.e. #classify has been run). Returns false otherwise.
classifier = Reclassifier::Bayes.new([:one, :other])
classifier.cache_set?
=> false
classifier.train(:one, 'bbb')
classifier.train(:other, 'aaa')
classifier.classify('aaa')
classifier.cache_set?
=> true
191 192 193 |
# File 'lib/reclassifier/bayes.rb', line 191 def cache_set? @cache.present? end |
#calculate_scores(text) ⇒ Object
Returns the scores of the specified text for each classification.
b.calculate_scores("I hate bad words and you")
=> {"Uninteresting"=>-12.6997928013932, "Interesting"=>-18.4206807439524}
The largest of these scores (the one closest to 0) is the one picked out by #classify
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/reclassifier/bayes.rb', line 76 def calculate_scores(text) scores = {} @cache[:total_docs_classified_log] ||= Math.log(@docs_in_classification_count.values.reduce(:+)) @cache[:words_classified] ||= @classifications.values.reduce(Set.new) {|set, word_counts| set.merge(word_counts.keys)} @classifications.each do |classification, classification_word_counts| # prior scores[classification] = Math.log(@docs_in_classification_count[classification]) scores[classification] -= @cache[:total_docs_classified_log] # likelihood classification_word_count = classification_word_counts.values.reduce(:+).to_i smart_word_hash(text).each do |word, count| if @cache[:words_classified].include?(word) scores[classification] += count * Math.log((classification_word_counts[word] || 0) + 1) scores[classification] -= count * Math.log(classification_word_count + @cache[:words_classified].count) end end end scores end |
#classifications ⇒ Object
Provides a list of classification names
b.classifications
=> [:this, :that, :the_other]
116 117 118 |
# File 'lib/reclassifier/bayes.rb', line 116 def classifications @classifications.keys end |
#classify(text) ⇒ Object
Returns the classification of the specified text, which is one of the classifications given in the initializer.
b.classify("I hate bad words and you")
=> :uninteresting
107 108 109 |
# File 'lib/reclassifier/bayes.rb', line 107 def classify(text) calculate_scores(text.to_s).max_by {|classification| classification[1]}[0] end |
#invalidate_cache ⇒ Object
Invalidates the cache.
classifier = Reclassifier::Bayes.new([:one, :other])
classifier.train(:one, 'bbb')
classifier.train(:other, 'aaa')
classifier.classify('aaa')
classifier.cache_set?
=> true
classifier.invalidate_cache
classifier.cache_set?
=> false
171 172 173 |
# File 'lib/reclassifier/bayes.rb', line 171 def invalidate_cache @cache = {} end |
#remove_classification(classification) ⇒ Object
Removes the classification from the classifier. Returns the classifier if the classification existed, else nil.
b.remove_classification(:not_spam)
=> :not_spam
142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/reclassifier/bayes.rb', line 142 def remove_classification(classification) return_value = if @classifications.include?(classification) classification else nil end @classifications.delete(classification) return_value end |
#train(classification, text) ⇒ Object
Provides a general training method for all classifications specified in Bayes#new
b = Reclassifier::Bayes.new([:this, :that])
b.train(:this, "This text")
b.train(:that, "That text")
39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/reclassifier/bayes.rb', line 39 def train(classification, text) ensure_classification_exists(classification) update_doc_count(classification, 1) smart_word_hash(text).each do |word, count| @classifications[classification][word] ||= 0 @classifications[classification][word] += count end end |
#untrain(classification, text) ⇒ Object
Untrain a (classification, text) pair. Be very careful with this method.
b = Reclassifier::Bayes.new([:this, :that])
b.train(:this, "This text")
b.untrain(:this, "This text")
59 60 61 62 63 64 65 66 67 |
# File 'lib/reclassifier/bayes.rb', line 59 def untrain(classification, text) ensure_classification_exists(classification) update_doc_count(classification, -1) smart_word_hash(text).each do |word, count| @classifications[classification][word] -= count if @classifications[classification].include?(word) end end |