Class: Ai4r::Clusterers::KMeans

Inherits:

Clusterer

Object
Clusterer
Ai4r::Clusterers::KMeans

show all

Defined in:: lib/ai4r/clusterers/k_means.rb

Overview

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, with k < n.

More about K Means algorithm: en.wikipedia.org/wiki/K-means_algorithm

Direct Known Subclasses

BisectingKMeans

Instance Attribute Summary collapse

#centroids ⇒ Object readonly

Returns the value of attribute centroids.
#clusters ⇒ Object readonly

Returns the value of attribute clusters.
#data_set ⇒ Object readonly

Returns the value of attribute data_set.
#history ⇒ Object readonly

Returns the value of attribute history.
#iterations ⇒ Object readonly

Returns the value of attribute iterations.
#number_of_clusters ⇒ Object readonly

Returns the value of attribute number_of_clusters.

Instance Method Summary collapse

#build(data_set, number_of_clusters) ⇒ Object

Build a new clusterer, using data examples found in data_set.
#distance(a, b) ⇒ Object

This function calculates the distance between 2 different instances.
#eval(data_item) ⇒ Object

Classifies the given data item, returning the cluster index it belongs to (0-based).
#initialize ⇒ Object constructor
#sse ⇒ Object

Sum of squared distances of all points to their respective centroids.

Methods inherited from Clusterer

#supports_eval?

Methods included from Data::Parameterizable

#get_parameters, included, #set_parameters

Constructor Details

#initialize ⇒ `Object`

# File 'lib/ai4r/clusterers/k_means.rb', line 57

def initialize
  super()
  @distance_function = nil
  @max_iterations = nil
  @centroid_function = lambda do |data_sets|
    data_sets.collect(&:get_mean_or_mode)
  end
  @centroid_indices = []
  @on_empty = 'eliminate' # default if none specified
  @random_seed = nil
  @rng = nil
  @init_method = :random
  @restarts = 1
  @track_history = false
end

Instance Attribute Details

#centroids ⇒ `Object` (readonly)

Returns the value of attribute centroids.



24
25
26

# File 'lib/ai4r/clusterers/k_means.rb', line 24

def centroids
  @centroids
end

#clusters ⇒ `Object` (readonly)

Returns the value of attribute clusters.



24
25
26

# File 'lib/ai4r/clusterers/k_means.rb', line 24

def clusters
  @clusters
end

#data_set ⇒ `Object` (readonly)

Returns the value of attribute data_set.



24
25
26

# File 'lib/ai4r/clusterers/k_means.rb', line 24

def data_set
  @data_set
end

#history ⇒ `Object` (readonly)

Returns the value of attribute history.



24
25
26

# File 'lib/ai4r/clusterers/k_means.rb', line 24

def history
  @history
end

#iterations ⇒ `Object` (readonly)

Returns the value of attribute iterations.



24
25
26

# File 'lib/ai4r/clusterers/k_means.rb', line 24

def iterations
  @iterations
end

#number_of_clusters ⇒ `Object` (readonly)

Returns the value of attribute number_of_clusters.



24
25
26

# File 'lib/ai4r/clusterers/k_means.rb', line 24

def number_of_clusters
  @number_of_clusters
end

Instance Method Details

#build(data_set, number_of_clusters) ⇒ `Object`

Build a new clusterer, using data examples found in data_set. Items will be clustered in “number_of_clusters” different clusters.

Parameters:

data_set (Object)
number_of_clusters (Object)

Returns:

(Object)

Raises:

(ArgumentError)

# File 'lib/ai4r/clusterers/k_means.rb', line 79

def build(data_set, number_of_clusters)
  @data_set = data_set
  @number_of_clusters = number_of_clusters
  raise ArgumentError, 'Number of clusters larger than data items' if @number_of_clusters > @data_set.data_items.length

  unless @centroid_indices.empty? || @centroid_indices.length == @number_of_clusters
    raise ArgumentError,
          'Length of centroid indices array differs from the specified number of clusters'
  end
  unless @on_empty == 'eliminate' || @on_empty == 'terminate' || @on_empty == 'random' || @on_empty == 'outlier'
    raise ArgumentError,
          'Invalid value for on_empty'
  end

  seed_base = @random_seed
  best_sse = nil
  best_centroids = nil
  best_clusters = nil
  best_iterations = nil

  (@restarts || 1).times do |i|
    @random_seed = seed_base.nil? ? nil : seed_base + i
    @rng = @random_seed.nil? ? Random.new : Random.new(@random_seed)
    @iterations = 0
    @history = [] if @track_history
    calc_initial_centroids
    until stop_criteria_met
      calculate_membership_clusters
      if @track_history
        @history << {
          centroids: @centroids.collect(&:dup),
          assignments: @assignments.dup
        }
      end
      recompute_centroids
    end
    current_sse = sse
    next unless best_sse.nil? || current_sse < best_sse

    best_sse = current_sse
    best_centroids = Marshal.load(Marshal.dump(@centroids))
    best_clusters = Marshal.load(Marshal.dump(@clusters))
    best_iterations = @iterations
  end

  @random_seed = seed_base
  @rng = @random_seed.nil? ? Random.new : Random.new(@random_seed)
  @centroids = best_centroids
  @clusters = best_clusters
  @iterations = best_iterations
  self
end

#distance(a, b) ⇒ `Object`

This function calculates the distance between 2 different instances. By default, it returns the euclidean distance to the power of 2. You can provide a more convenient distance implementation:

1- Overwriting this method

2- Providing a closure to the :distance_function parameter

Parameters:

a (Object)
b (Object)

Returns:

(Object)

# File 'lib/ai4r/clusterers/k_means.rb', line 167

def distance(a, b)
  return @distance_function.call(a, b) if @distance_function

  Ai4r::Data::Proximity.squared_euclidean_distance(
    a.select { |att_a| att_a.is_a? Numeric },
    b.select { |att_b| att_b.is_a? Numeric }
  )
end

#eval(data_item) ⇒ `Object`

Classifies the given data item, returning the cluster index it belongs to (0-based).

Parameters:

data_item (Object)

Returns:

(Object)

# File 'lib/ai4r/clusterers/k_means.rb', line 136

def eval(data_item)
  get_min_index(@centroids.collect do |centroid|
    distance(data_item, centroid)
  end)
end

#sse ⇒ `Object`

Sum of squared distances of all points to their respective centroids. It can be used as a measure of cluster compactness (SSE).

Returns:

(Object)

# File 'lib/ai4r/clusterers/k_means.rb', line 145

def sse
  sum = 0.0
  @clusters.each_with_index do |cluster, i|
    centroid = @centroids[i]
    cluster.data_items.each do |item|
      sum += distance(item, centroid)
    end
  end
  sum
end

Class: Ai4r::Clusterers::KMeans

Overview

Direct Known Subclasses

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods inherited from Clusterer

Methods included from Data::Parameterizable

Constructor Details

#initialize ⇒ Object

Instance Attribute Details

#centroids ⇒ Object (readonly)

#clusters ⇒ Object (readonly)

#data_set ⇒ Object (readonly)

#history ⇒ Object (readonly)

#iterations ⇒ Object (readonly)

#number_of_clusters ⇒ Object (readonly)

Instance Method Details

#build(data_set, number_of_clusters) ⇒ Object

#distance(a, b) ⇒ Object

#eval(data_item) ⇒ Object

#sse ⇒ Object

#initialize ⇒ `Object`

#centroids ⇒ `Object` (readonly)

#clusters ⇒ `Object` (readonly)

#data_set ⇒ `Object` (readonly)

#history ⇒ `Object` (readonly)

#iterations ⇒ `Object` (readonly)

#number_of_clusters ⇒ `Object` (readonly)

#build(data_set, number_of_clusters) ⇒ `Object`

#distance(a, b) ⇒ `Object`

#eval(data_item) ⇒ `Object`

#sse ⇒ `Object`