Class: Ai4r::Clusterers::KMeans
- Defined in:
- lib/ai4r/clusterers/k_means.rb
Overview
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, with k < n.
More about K Means algorithm: en.wikipedia.org/wiki/K-means_algorithm
Direct Known Subclasses
Instance Attribute Summary collapse
-
#centroids ⇒ Object
readonly
Returns the value of attribute centroids.
-
#clusters ⇒ Object
readonly
Returns the value of attribute clusters.
-
#data_set ⇒ Object
readonly
Returns the value of attribute data_set.
-
#history ⇒ Object
readonly
Returns the value of attribute history.
-
#iterations ⇒ Object
readonly
Returns the value of attribute iterations.
-
#number_of_clusters ⇒ Object
readonly
Returns the value of attribute number_of_clusters.
Instance Method Summary collapse
-
#build(data_set, number_of_clusters) ⇒ Object
Build a new clusterer, using data examples found in data_set.
-
#distance(a, b) ⇒ Object
This function calculates the distance between 2 different instances.
-
#eval(data_item) ⇒ Object
Classifies the given data item, returning the cluster index it belongs to (0-based).
- #initialize ⇒ Object constructor
-
#sse ⇒ Object
Sum of squared distances of all points to their respective centroids.
Methods inherited from Clusterer
Methods included from Data::Parameterizable
#get_parameters, included, #set_parameters
Constructor Details
#initialize ⇒ Object
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/ai4r/clusterers/k_means.rb', line 57 def initialize super() @distance_function = nil @max_iterations = nil @centroid_function = lambda do |data_sets| data_sets.collect(&:get_mean_or_mode) end @centroid_indices = [] @on_empty = 'eliminate' # default if none specified @random_seed = nil @rng = nil @init_method = :random @restarts = 1 @track_history = false end |
Instance Attribute Details
#centroids ⇒ Object (readonly)
Returns the value of attribute centroids.
24 25 26 |
# File 'lib/ai4r/clusterers/k_means.rb', line 24 def centroids @centroids end |
#clusters ⇒ Object (readonly)
Returns the value of attribute clusters.
24 25 26 |
# File 'lib/ai4r/clusterers/k_means.rb', line 24 def clusters @clusters end |
#data_set ⇒ Object (readonly)
Returns the value of attribute data_set.
24 25 26 |
# File 'lib/ai4r/clusterers/k_means.rb', line 24 def data_set @data_set end |
#history ⇒ Object (readonly)
Returns the value of attribute history.
24 25 26 |
# File 'lib/ai4r/clusterers/k_means.rb', line 24 def history @history end |
#iterations ⇒ Object (readonly)
Returns the value of attribute iterations.
24 25 26 |
# File 'lib/ai4r/clusterers/k_means.rb', line 24 def iterations @iterations end |
#number_of_clusters ⇒ Object (readonly)
Returns the value of attribute number_of_clusters.
24 25 26 |
# File 'lib/ai4r/clusterers/k_means.rb', line 24 def number_of_clusters @number_of_clusters end |
Instance Method Details
#build(data_set, number_of_clusters) ⇒ Object
Build a new clusterer, using data examples found in data_set. Items will be clustered in “number_of_clusters” different clusters.
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# File 'lib/ai4r/clusterers/k_means.rb', line 79 def build(data_set, number_of_clusters) @data_set = data_set @number_of_clusters = number_of_clusters raise ArgumentError, 'Number of clusters larger than data items' if @number_of_clusters > @data_set.data_items.length unless @centroid_indices.empty? || @centroid_indices.length == @number_of_clusters raise ArgumentError, 'Length of centroid indices array differs from the specified number of clusters' end unless @on_empty == 'eliminate' || @on_empty == 'terminate' || @on_empty == 'random' || @on_empty == 'outlier' raise ArgumentError, 'Invalid value for on_empty' end seed_base = @random_seed best_sse = nil best_centroids = nil best_clusters = nil best_iterations = nil (@restarts || 1).times do |i| @random_seed = seed_base.nil? ? nil : seed_base + i @rng = @random_seed.nil? ? Random.new : Random.new(@random_seed) @iterations = 0 @history = [] if @track_history calc_initial_centroids until stop_criteria_met calculate_membership_clusters if @track_history @history << { centroids: @centroids.collect(&:dup), assignments: @assignments.dup } end recompute_centroids end current_sse = sse next unless best_sse.nil? || current_sse < best_sse best_sse = current_sse best_centroids = Marshal.load(Marshal.dump(@centroids)) best_clusters = Marshal.load(Marshal.dump(@clusters)) best_iterations = @iterations end @random_seed = seed_base @rng = @random_seed.nil? ? Random.new : Random.new(@random_seed) @centroids = best_centroids @clusters = best_clusters @iterations = best_iterations self end |
#distance(a, b) ⇒ Object
This function calculates the distance between 2 different instances. By default, it returns the euclidean distance to the power of 2. You can provide a more convenient distance implementation:
1- Overwriting this method
2- Providing a closure to the :distance_function parameter
167 168 169 170 171 172 173 174 |
# File 'lib/ai4r/clusterers/k_means.rb', line 167 def distance(a, b) return @distance_function.call(a, b) if @distance_function Ai4r::Data::Proximity.squared_euclidean_distance( a.select { |att_a| att_a.is_a? Numeric }, b.select { |att_b| att_b.is_a? Numeric } ) end |
#eval(data_item) ⇒ Object
Classifies the given data item, returning the cluster index it belongs to (0-based).
136 137 138 139 140 |
# File 'lib/ai4r/clusterers/k_means.rb', line 136 def eval(data_item) get_min_index(@centroids.collect do |centroid| distance(data_item, centroid) end) end |
#sse ⇒ Object
Sum of squared distances of all points to their respective centroids. It can be used as a measure of cluster compactness (SSE).
145 146 147 148 149 150 151 152 153 154 |
# File 'lib/ai4r/clusterers/k_means.rb', line 145 def sse sum = 0.0 @clusters.each_with_index do |cluster, i| centroid = @centroids[i] cluster.data_items.each do |item| sum += distance(item, centroid) end end sum end |