Class: Basset::NaiveBayes
- Inherits:
-
Object
- Object
- Basset::NaiveBayes
- Includes:
- YamlSerialization
- Defined in:
- lib/basset/naive_bayes.rb
Overview
A class for running Naive Bayes classification. Documents are added to the classifier. Once they are added it can be used to classify new documents.
Defined Under Namespace
Classes: FeatureCount
Instance Attribute Summary collapse
-
#feature_counts ⇒ Object
readonly
Returns the value of attribute feature_counts.
-
#total_docs ⇒ Object
readonly
Returns the value of attribute total_docs.
-
#total_docs_in_class ⇒ Object
readonly
Returns the value of attribute total_docs_in_class.
Instance Method Summary collapse
- #==(other) ⇒ Object
-
#add_document(classification, feature_vector) ⇒ Object
takes a classification which can be a string and a vector of features.
- #classes ⇒ Object
-
#classify(feature_vectors, opts = {:normalize_classes=>true}) ⇒ Object
returns the most likely class given a vector of features.
-
#initialize ⇒ NaiveBayes
constructor
A new instance of NaiveBayes.
-
#occurrences_of_all_features_in_class(classification) ⇒ Object
The sum total of times all features occurs for a given class.
-
#probability_of_vector_for_class(feature_vector, classification) ⇒ Object
returns the probability of a feature given the class.
-
#probability_of_vectors_for_class(feature_vectors, classification, opts = {:normalize=>false}) ⇒ Object
Gives a score for probability of feature_vector being in class classification.
Methods included from YamlSerialization
Constructor Details
#initialize ⇒ NaiveBayes
Returns a new instance of NaiveBayes.
13 14 15 16 17 18 |
# File 'lib/basset/naive_bayes.rb', line 13 def initialize @total_docs = 0 @total_docs_in_class = Hash.new(0) @feature_counts = {} @occurrences_of_all_features_in_class = {} end |
Instance Attribute Details
#feature_counts ⇒ Object (readonly)
Returns the value of attribute feature_counts.
11 12 13 |
# File 'lib/basset/naive_bayes.rb', line 11 def feature_counts @feature_counts end |
#total_docs ⇒ Object (readonly)
Returns the value of attribute total_docs.
11 12 13 |
# File 'lib/basset/naive_bayes.rb', line 11 def total_docs @total_docs end |
#total_docs_in_class ⇒ Object (readonly)
Returns the value of attribute total_docs_in_class.
11 12 13 |
# File 'lib/basset/naive_bayes.rb', line 11 def total_docs_in_class @total_docs_in_class end |
Instance Method Details
#==(other) ⇒ Object
98 99 100 101 |
# File 'lib/basset/naive_bayes.rb', line 98 def ==(other) other.is_a?(self.class) && other.total_docs == total_docs && other.total_docs_in_class == total_docs_in_class && other.feature_counts == feature_counts end |
#add_document(classification, feature_vector) ⇒ Object
takes a classification which can be a string and a vector of features.
22 23 24 25 26 27 28 29 30 31 32 |
# File 'lib/basset/naive_bayes.rb', line 22 def add_document(classification, feature_vector) reset_cached_probabilities @total_docs_in_class[classification] += 1 @total_docs += 1 feature_vector.each do |feature| @feature_counts[feature.name] ||= FeatureCount.new(feature.name) @feature_counts[feature.name].add_count_for_class(feature.value, classification) end end |
#classes ⇒ Object
34 35 36 |
# File 'lib/basset/naive_bayes.rb', line 34 def classes @total_docs_in_class.keys end |
#classify(feature_vectors, opts = {:normalize_classes=>true}) ⇒ Object
returns the most likely class given a vector of features
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# File 'lib/basset/naive_bayes.rb', line 39 def classify(feature_vectors, opts={:normalize_classes=>true}) class_probabilities = [] classes.each do |classification| class_probability = 0 class_probability += Math.log10(probability_of_class(classification)) if opts[:normalize_classes] class_probability += probability_of_vectors_for_class(feature_vectors, classification) class_probabilities << [class_probability, classification] end # this next bit picks a random item first # this covers the case that all the class probabilities are equal and we need to randomly select a class max = class_probabilities.pick_random class_probabilities.each do |cp| max = cp if cp.first > max.first end max end |
#occurrences_of_all_features_in_class(classification) ⇒ Object
The sum total of times all features occurs for a given class.
87 88 89 90 91 92 93 94 95 96 |
# File 'lib/basset/naive_bayes.rb', line 87 def occurrences_of_all_features_in_class(classification) # return the cached value, if there is one return @occurrences_of_all_features_in_class[classification] if @occurrences_of_all_features_in_class[classification] @feature_counts.each_value do |feature_count| @occurrences_of_all_features_in_class[classification] ||= 0 @occurrences_of_all_features_in_class[classification] += feature_count.count_for_class(classification) end @occurrences_of_all_features_in_class[classification] end |
#probability_of_vector_for_class(feature_vector, classification) ⇒ Object
returns the probability of a feature given the class
79 80 81 82 83 84 |
# File 'lib/basset/naive_bayes.rb', line 79 def probability_of_vector_for_class(feature_vector, classification) # the reason the rescue 0 is in there is tricky # because of the removal of redundant unigrams, it's possible that one of the features is never used/initialized decimal_probability = (((@feature_counts[feature_vector.name].count_for_class(classification) rescue 0) + 0.1)/ occurrences_of_all_features_in_class(classification).to_f) * feature_vector.value Math.log10(decimal_probability) end |
#probability_of_vectors_for_class(feature_vectors, classification, opts = {:normalize=>false}) ⇒ Object
Gives a score for probability of feature_vector being in class classification.
This score can be normalized to the number of feature vectors by passing :normalize => true for the third argument.
Score is not normalized for the relatives probabilities of each class.
66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/basset/naive_bayes.rb', line 66 def probability_of_vectors_for_class(feature_vectors, classification, opts={:normalize=>false}) probability = 0 feature_vectors.each do |feature_vector| probability += probability_of_vector_for_class(feature_vector, classification) end if opts[:normalize] probability / feature_vectors.count.to_f else probability end end |