Class: NaiveBayesClassifier

Inherits:

Object

Object
NaiveBayesClassifier

show all

Defined in:: lib/unsupervised-language-detection/naive-bayes-classifier.rb

Instance Attribute Summary collapse

#category_names ⇒ Object

Returns the value of attribute category_names.
#num_categories ⇒ Object readonly

Returns the value of attribute num_categories.
#prior_category_counts ⇒ Object readonly

Returns the value of attribute prior_category_counts.
#prior_token_count ⇒ Object readonly

Returns the value of attribute prior_token_count.

Class Method Summary collapse

.train_em(max_epochs, training_examples) ⇒ Object

Performs a Naive Bayes EM algorithm with two classes.

Instance Method Summary collapse

#classify(tokens) ⇒ Object

Returns the index (not the name) of the category the tokens are classified under.
#get_posterior_category_probabilities(tokens) ⇒ Object

Returns p(category | token), for each category, in an array.
#get_prior_category_probability(category_index) ⇒ Object

Returns p(category).
#get_token_probability(token, category_index) ⇒ Object

Returns p(token | category).
#initialize(options = {}) ⇒ NaiveBayesClassifier constructor

Parameters ———- num_categories: number of categories we want to classify.
#train(example, category_index, probability = 1) ⇒ Object

Given a labeled training example (i.e., an array of tokens and its probability of belonging to a certain category), update the parameters of the Naive Bayes model.

Constructor Details

#initialize(options = {}) ⇒ `NaiveBayesClassifier`

Parameters

num_categories: number of categories we want to classify. prior_category_counts: array of parameters for a Dirichlet prior that we place on the prior probabilities of each category. (In other words, these are “virtual counts” of the number of times we have seen each category previously.) Set the array to all 0’s if you want to use maximum likelihood estimates. Defaults to uniform reals from the unit interval if nothing is set. prior_token_count: parameter for a beta prior that we place on p(token|category). (In other words, this is a “virtual count” of the number of times we have seen each token previously.) Set to 0 if you want to use maximum likelihood estimates.

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 10

def initialize(options = {})
  options = {:num_categories => 2,
             :prior_token_count => 0.0001}.merge(options)

  @num_categories = options[:num_categories]
  @prior_token_count = options[:prior_token_count]
  @prior_category_counts = options[:prior_category_counts] || Array.new(@num_categories) { rand }
  @category_names = options[:category_names] || (0..num_categories-1).map(&:to_s).to_a
  
  # `@token_counts[category][token]` is the (weighted) number of times we have seen `token` with this category.
  @token_counts = Array.new(@num_categories) do
    Hash.new { |h, token| h[token] = 0 }
  end
  
  # `@total_token_counts[category]` is always equal to `@token_counts[category].sum`.
  @total_token_counts = Array.new(@num_categories, 0)
  
  # `@category_counts[category]` is the (weighted) number of training examples we have seen with this category.
  @category_counts = Array.new(@num_categories, 0)
end

Instance Attribute Details

#category_names ⇒ `Object`

Returns the value of attribute category_names.



3
4
5

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 3

def category_names
  @category_names
end

#num_categories ⇒ `Object` (readonly)

Returns the value of attribute num_categories.



2
3
4

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 2

def num_categories
  @num_categories
end

#prior_category_counts ⇒ `Object` (readonly)

Returns the value of attribute prior_category_counts.



2
3
4

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 2

def prior_category_counts
  @prior_category_counts
end

#prior_token_count ⇒ `Object` (readonly)

Returns the value of attribute prior_token_count.



2
3
4

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 2

def prior_token_count
  @prior_token_count
end

Class Method Details

.train_em(max_epochs, training_examples) ⇒ `Object`

Performs a Naive Bayes EM algorithm with two classes.

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 47

def self.train_em(max_epochs, training_examples)
  prev_classifier = NaiveBayesClassifier.new
  max_epochs.times do
    classifier = NaiveBayesClassifier.new
  
    # E-M training
    training_examples.each do |example|
      # E-step: for each training example, recompute its classification probabilities.
      posterior_category_probs = prev_classifier.get_posterior_category_probabilities(example) 
            
      # M-step: for each category, recompute the probability of generating each token.
      posterior_category_probs.each_with_index do |p, category|
        classifier.train(example, category, p) 
      end
    end    
    prev_classifier = classifier    
    # TODO: add a convergence check, so we can break out early if we want.
  end
  return prev_classifier
end

Instance Method Details

#classify(tokens) ⇒ `Object`

Returns the index (not the name) of the category the tokens are classified under.

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 69

def classify(tokens)
  max_prob, max_category = -1, -1

  if tokens.empty?      
    # If the example is empty, find the category with the highest prior probability.
    (0..@num_categories - 1).each do |i|
      prior_prob = get_prior_category_probability(i)
      max_prob, max_category = prior_prob, i if prior_prob > max_prob
    end
  else
    # Otherwise, find the category with the highest posterior probability.
    get_posterior_category_probabilities(tokens).each_with_index do |prob, category|
      max_prob, max_category = prob, category if prob > max_prob
    end
  end
  
  return max_category
end

#get_posterior_category_probabilities(tokens) ⇒ `Object`

Returns p(category | token), for each category, in an array.

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 89

def get_posterior_category_probabilities(tokens)
  unnormalized_posterior_probs = (0..@num_categories-1).map do |category|
    p = tokens.map { |token| get_token_probability(token, category) }.reduce(:*) # p(tokens | category)
    p * get_prior_category_probability(category) # p(tokens | category) * p(category)
  end
  normalization = unnormalized_posterior_probs.reduce(:+)
  normalization = 1 if normalization == 0
  return unnormalized_posterior_probs.map{ |p| p / normalization }
end

#get_prior_category_probability(category_index) ⇒ `Object`

Returns p(category).

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 110

def get_prior_category_probability(category_index)
  denom = @category_counts.reduce(:+) + @prior_category_counts.reduce(:+)
  if denom == 0
    return 0
  else
    return (@category_counts[category_index] + @prior_category_counts[category_index]).to_f / denom
  end
end

#get_token_probability(token, category_index) ⇒ `Object`

Returns p(token | category).

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 100

def get_token_probability(token, category_index)
  denom = @total_token_counts[category_index] + @token_counts[category_index].size * @prior_token_count    
  if denom == 0
    return 0
  else
    return ((@token_counts[category_index][token] || 0) + @prior_token_count).to_f / denom
  end
end

#train(example, category_index, probability = 1) ⇒ `Object`

Given a labeled training example (i.e., an array of tokens and its probability of belonging to a certain category), update the parameters of the Naive Bayes model. Parameters

example: an array of tokens. category_index: the index of the category this example belongs to. probability: the probability that the example belongs to the category.

# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 38

def train(example, category_index, probability = 1)
  example.each do |token|
    @token_counts[category_index][token] += probability
    @total_token_counts[category_index] += probability
  end
  @category_counts[category_index] += probability
end

Class: NaiveBayesClassifier

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ NaiveBayesClassifier

Instance Attribute Details

#category_names ⇒ Object

#num_categories ⇒ Object (readonly)

#prior_category_counts ⇒ Object (readonly)

#prior_token_count ⇒ Object (readonly)

Class Method Details

.train_em(max_epochs, training_examples) ⇒ Object