Class: Ai4r::Clusterers::KMeans

Inherits:
Clusterer show all
Defined in:
lib/ai4r/clusterers/k_means.rb

Overview

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, with k < n.

More about K Means algorithm: en.wikipedia.org/wiki/K-means_algorithm

Direct Known Subclasses

BisectingKMeans

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods inherited from Clusterer

#supports_eval?

Methods included from Data::Parameterizable

#get_parameters, included, #set_parameters

Constructor Details

#initializeObject



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# File 'lib/ai4r/clusterers/k_means.rb', line 57

def initialize
  super()
  @distance_function = nil
  @max_iterations = nil
  @centroid_function = lambda do |data_sets|
    data_sets.collect(&:get_mean_or_mode)
  end
  @centroid_indices = []
  @on_empty = 'eliminate' # default if none specified
  @random_seed = nil
  @rng = nil
  @init_method = :random
  @restarts = 1
  @track_history = false
end

Instance Attribute Details

#centroidsObject (readonly)

Returns the value of attribute centroids.



24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def centroids
  @centroids
end

#clustersObject (readonly)

Returns the value of attribute clusters.



24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def clusters
  @clusters
end

#data_setObject (readonly)

Returns the value of attribute data_set.



24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def data_set
  @data_set
end

#historyObject (readonly)

Returns the value of attribute history.



24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def history
  @history
end

#iterationsObject (readonly)

Returns the value of attribute iterations.



24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def iterations
  @iterations
end

#number_of_clustersObject (readonly)

Returns the value of attribute number_of_clusters.



24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def number_of_clusters
  @number_of_clusters
end

Instance Method Details

#build(data_set, number_of_clusters) ⇒ Object

Build a new clusterer, using data examples found in data_set. Items will be clustered in “number_of_clusters” different clusters.

Parameters:

  • data_set (Object)
  • number_of_clusters (Object)

Returns:

  • (Object)

Raises:

  • (ArgumentError)


79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# File 'lib/ai4r/clusterers/k_means.rb', line 79

def build(data_set, number_of_clusters)
  @data_set = data_set
  @number_of_clusters = number_of_clusters
  raise ArgumentError, 'Number of clusters larger than data items' if @number_of_clusters > @data_set.data_items.length

  unless @centroid_indices.empty? || @centroid_indices.length == @number_of_clusters
    raise ArgumentError,
          'Length of centroid indices array differs from the specified number of clusters'
  end
  unless @on_empty == 'eliminate' || @on_empty == 'terminate' || @on_empty == 'random' || @on_empty == 'outlier'
    raise ArgumentError,
          'Invalid value for on_empty'
  end

  seed_base = @random_seed
  best_sse = nil
  best_centroids = nil
  best_clusters = nil
  best_iterations = nil

  (@restarts || 1).times do |i|
    @random_seed = seed_base.nil? ? nil : seed_base + i
    @rng = @random_seed.nil? ? Random.new : Random.new(@random_seed)
    @iterations = 0
    @history = [] if @track_history
    calc_initial_centroids
    until stop_criteria_met
      calculate_membership_clusters
      if @track_history
        @history << {
          centroids: @centroids.collect(&:dup),
          assignments: @assignments.dup
        }
      end
      recompute_centroids
    end
    current_sse = sse
    next unless best_sse.nil? || current_sse < best_sse

    best_sse = current_sse
    best_centroids = Marshal.load(Marshal.dump(@centroids))
    best_clusters = Marshal.load(Marshal.dump(@clusters))
    best_iterations = @iterations
  end

  @random_seed = seed_base
  @rng = @random_seed.nil? ? Random.new : Random.new(@random_seed)
  @centroids = best_centroids
  @clusters = best_clusters
  @iterations = best_iterations
  self
end

#distance(a, b) ⇒ Object

This function calculates the distance between 2 different instances. By default, it returns the euclidean distance to the power of 2. You can provide a more convenient distance implementation:

1- Overwriting this method

2- Providing a closure to the :distance_function parameter

Parameters:

  • a (Object)
  • b (Object)

Returns:

  • (Object)


167
168
169
170
171
172
173
174
# File 'lib/ai4r/clusterers/k_means.rb', line 167

def distance(a, b)
  return @distance_function.call(a, b) if @distance_function

  Ai4r::Data::Proximity.squared_euclidean_distance(
    a.select { |att_a| att_a.is_a? Numeric },
    b.select { |att_b| att_b.is_a? Numeric }
  )
end

#eval(data_item) ⇒ Object

Classifies the given data item, returning the cluster index it belongs to (0-based).

Parameters:

  • data_item (Object)

Returns:

  • (Object)


136
137
138
139
140
# File 'lib/ai4r/clusterers/k_means.rb', line 136

def eval(data_item)
  get_min_index(@centroids.collect do |centroid|
    distance(data_item, centroid)
  end)
end

#sseObject

Sum of squared distances of all points to their respective centroids. It can be used as a measure of cluster compactness (SSE).

Returns:

  • (Object)


145
146
147
148
149
150
151
152
153
154
# File 'lib/ai4r/clusterers/k_means.rb', line 145

def sse
  sum = 0.0
  @clusters.each_with_index do |cluster, i|
    centroid = @centroids[i]
    cluster.data_items.each do |item|
      sum += distance(item, centroid)
    end
  end
  sum
end