Class: Ai4r::Clusterers::KMeans

Inherits:
Clusterer show all
Defined in:
lib/ai4r/clusterers/k_means.rb

Overview

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, with k < n.

More about K Means algorithm: en.wikipedia.org/wiki/K-means_algorithm

Direct Known Subclasses

BisectingKMeans

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Data::Parameterizable

#get_parameters, included, #set_parameters

Constructor Details

#initializeKMeans

Returns a new instance of KMeans


48
49
50
51
52
53
54
55
56
# File 'lib/ai4r/clusterers/k_means.rb', line 48

def initialize
  @distance_function = nil
  @max_iterations = nil
  @centroid_function = lambda do |data_sets| 
    data_sets.collect{ |data_set| data_set.get_mean_or_mode}
  end
  @centroid_indices = []
  @on_empty = 'eliminate' # default if none specified
end

Instance Attribute Details

#centroidsObject (readonly)

Returns the value of attribute centroids


25
26
27
# File 'lib/ai4r/clusterers/k_means.rb', line 25

def centroids
  @centroids
end

#clustersObject (readonly)

Returns the value of attribute clusters


25
26
27
# File 'lib/ai4r/clusterers/k_means.rb', line 25

def clusters
  @clusters
end

#data_setObject (readonly)

Returns the value of attribute data_set


24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def data_set
  @data_set
end

#iterationsObject (readonly)

Returns the value of attribute iterations


25
26
27
# File 'lib/ai4r/clusterers/k_means.rb', line 25

def iterations
  @iterations
end

#number_of_clustersObject (readonly)

Returns the value of attribute number_of_clusters


24
25
26
# File 'lib/ai4r/clusterers/k_means.rb', line 24

def number_of_clusters
  @number_of_clusters
end

Instance Method Details

#build(data_set, number_of_clusters) ⇒ Object

Build a new clusterer, using data examples found in data_set. Items will be clustered in “number_of_clusters” different clusters.

Raises:

  • (ArgumentError)

62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/ai4r/clusterers/k_means.rb', line 62

def build(data_set, number_of_clusters)
  @data_set = data_set
  @number_of_clusters = number_of_clusters
  raise ArgumentError, 'Length of centroid indices array differs from the specified number of clusters' unless @centroid_indices.empty? || @centroid_indices.length == @number_of_clusters
  raise ArgumentError, 'Invalid value for on_empty' unless @on_empty == 'eliminate' || @on_empty == 'terminate' || @on_empty == 'random' || @on_empty == 'outlier'
  @iterations = 0
  
  calc_initial_centroids
  while(not stop_criteria_met)
    calculate_membership_clusters
    recompute_centroids
  end
  
  return self
end

#distance(a, b) ⇒ Object

This function calculates the distance between 2 different instances. By default, it returns the euclidean distance to the power of 2. You can provide a more convenient distance implementation:

1- Overwriting this method

2- Providing a closure to the :distance_function parameter


93
94
95
96
97
98
# File 'lib/ai4r/clusterers/k_means.rb', line 93

def distance(a, b)
  return @distance_function.call(a, b) if @distance_function
  return Ai4r::Data::Proximity.squared_euclidean_distance(
           a.select {|att_a| att_a.is_a? Numeric} , 
           b.select {|att_b| att_b.is_a? Numeric})
end

#eval(data_item) ⇒ Object

Classifies the given data item, returning the cluster index it belongs to (0-based).


80
81
82
83
# File 'lib/ai4r/clusterers/k_means.rb', line 80

def eval(data_item)
  get_min_index(@centroids.collect {|centroid| 
      distance(data_item, centroid)})
end