Class: Basset::NaiveBayes

Inherits:
Object
  • Object
show all
Includes:
YamlSerialization
Defined in:
lib/basset/naive_bayes.rb

Overview

A class for running Naive Bayes classification. Documents are added to the classifier. Once they are added it can be used to classify new documents.

Defined Under Namespace

Classes: FeatureCount

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from YamlSerialization

included, #save_to_file

Constructor Details

#initializeNaiveBayes

Returns a new instance of NaiveBayes.



13
14
15
16
17
18
# File 'lib/basset/naive_bayes.rb', line 13

def initialize
  @total_docs = 0
  @total_docs_in_class = Hash.new(0)
  @feature_counts = {}
  @occurrences_of_all_features_in_class = {}
end

Instance Attribute Details

#feature_countsObject (readonly)

Returns the value of attribute feature_counts.



11
12
13
# File 'lib/basset/naive_bayes.rb', line 11

def feature_counts
  @feature_counts
end

#total_docsObject (readonly)

Returns the value of attribute total_docs.



11
12
13
# File 'lib/basset/naive_bayes.rb', line 11

def total_docs
  @total_docs
end

#total_docs_in_classObject (readonly)

Returns the value of attribute total_docs_in_class.



11
12
13
# File 'lib/basset/naive_bayes.rb', line 11

def total_docs_in_class
  @total_docs_in_class
end

Instance Method Details

#==(other) ⇒ Object



98
99
100
101
# File 'lib/basset/naive_bayes.rb', line 98

def ==(other)
  other.is_a?(self.class) && other.total_docs == total_docs && 
  other.total_docs_in_class == total_docs_in_class && other.feature_counts == feature_counts
end

#add_document(classification, feature_vector) ⇒ Object

takes a classification which can be a string and a vector of features.



22
23
24
25
26
27
28
29
30
31
32
# File 'lib/basset/naive_bayes.rb', line 22

def add_document(classification, feature_vector)
  reset_cached_probabilities

  @total_docs_in_class[classification] += 1
  @total_docs += 1
  
  feature_vector.each do |feature|
    @feature_counts[feature.name] ||= FeatureCount.new(feature.name)
    @feature_counts[feature.name].add_count_for_class(feature.value, classification)
  end
end

#classesObject



34
35
36
# File 'lib/basset/naive_bayes.rb', line 34

def classes
  @total_docs_in_class.keys
end

#classify(feature_vectors, opts = {:normalize_classes=>true}) ⇒ Object

returns the most likely class given a vector of features



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/basset/naive_bayes.rb', line 39

def classify(feature_vectors, opts={:normalize_classes=>true})
  class_probabilities = []
  
  classes.each do |classification|
    class_probability = 0
    class_probability += Math.log10(probability_of_class(classification)) if opts[:normalize_classes]
    class_probability += probability_of_vectors_for_class(feature_vectors, classification)
    class_probabilities << [class_probability, classification]
  end
  
  # this next bit picks a random item first
  # this covers the case that all the class probabilities are equal and we need to randomly select a class
  max = class_probabilities.pick_random
  class_probabilities.each do |cp|
    max = cp if cp.first > max.first
  end
  max
end

#occurrences_of_all_features_in_class(classification) ⇒ Object

The sum total of times all features occurs for a given class.



87
88
89
90
91
92
93
94
95
96
# File 'lib/basset/naive_bayes.rb', line 87

def occurrences_of_all_features_in_class(classification)
  # return the cached value, if there is one
  return @occurrences_of_all_features_in_class[classification] if @occurrences_of_all_features_in_class[classification]

  @feature_counts.each_value do |feature_count|
    @occurrences_of_all_features_in_class[classification] ||= 0
    @occurrences_of_all_features_in_class[classification] += feature_count.count_for_class(classification)
  end
  @occurrences_of_all_features_in_class[classification]
end

#probability_of_vector_for_class(feature_vector, classification) ⇒ Object

returns the probability of a feature given the class



79
80
81
82
83
84
# File 'lib/basset/naive_bayes.rb', line 79

def probability_of_vector_for_class(feature_vector, classification)
  # the reason the rescue 0 is in there is tricky
  # because of the removal of redundant unigrams, it's possible that one of the features is never used/initialized
  decimal_probability = (((@feature_counts[feature_vector.name].count_for_class(classification) rescue 0) + 0.1)/ occurrences_of_all_features_in_class(classification).to_f) * feature_vector.value
  Math.log10(decimal_probability)
end

#probability_of_vectors_for_class(feature_vectors, classification, opts = {:normalize=>false}) ⇒ Object

Gives a score for probability of feature_vector being in class classification.

This score can be normalized to the number of feature vectors by passing :normalize => true for the third argument.

Score is not normalized for the relatives probabilities of each class.



66
67
68
69
70
71
72
73
74
75
76
# File 'lib/basset/naive_bayes.rb', line 66

def probability_of_vectors_for_class(feature_vectors, classification, opts={:normalize=>false})
  probability = 0
  feature_vectors.each do |feature_vector|
    probability += probability_of_vector_for_class(feature_vector, classification)
  end
  if opts[:normalize]
    probability / feature_vectors.count.to_f
  else
    probability
  end
end