Class: ClusterKit::Clustering::KMeans

Inherits:
Object
  • Object
show all
Defined in:
lib/clusterkit/clustering.rb

Overview

K-means clustering algorithm

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(k:, max_iter: 300, random_seed: nil) ⇒ KMeans

Initialize K-means clusterer

Parameters:

  • k (Integer)

    Number of clusters

  • max_iter (Integer) (defaults to: 300)

    Maximum iterations (default: 300)

  • random_seed (Integer) (defaults to: nil)

    Random seed for reproducibility (optional)

Raises:

  • (ArgumentError)


18
19
20
21
22
23
24
# File 'lib/clusterkit/clustering.rb', line 18

def initialize(k:, max_iter: 300, random_seed: nil)
  raise ArgumentError, "k must be positive" unless k > 0
  @k = k
  @max_iter = max_iter
  @random_seed = random_seed
  @fitted = false
end

Instance Attribute Details

#centroidsObject (readonly)

Returns the value of attribute centroids.



12
13
14
# File 'lib/clusterkit/clustering.rb', line 12

def centroids
  @centroids
end

#inertiaFloat (readonly)

Get the sum of squared distances of samples to their closest cluster center

Returns:

  • (Float)

    Inertia value



71
72
73
# File 'lib/clusterkit/clustering.rb', line 71

def inertia
  @inertia
end

#kObject (readonly)

Returns the value of attribute k.



12
13
14
# File 'lib/clusterkit/clustering.rb', line 12

def k
  @k
end

#labelsObject (readonly)

Returns the value of attribute labels.



12
13
14
# File 'lib/clusterkit/clustering.rb', line 12

def labels
  @labels
end

#max_iterObject (readonly)

Returns the value of attribute max_iter.



12
13
14
# File 'lib/clusterkit/clustering.rb', line 12

def max_iter
  @max_iter
end

Class Method Details

.detect_optimal_k(elbow_results, fallback_k: 3) ⇒ Integer

Detect optimal k from elbow method results

Parameters:

  • elbow_results (Hash)

    Mapping of k to inertia values (from elbow_method)

  • fallback_k (Integer) (defaults to: 3)

    Default k to return if detection fails (default: 3)

Returns:

  • (Integer)

    Optimal number of clusters



98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/clusterkit/clustering.rb', line 98

def detect_optimal_k(elbow_results, fallback_k: 3)
  return fallback_k if elbow_results.nil? || elbow_results.empty?
  
  k_values = elbow_results.keys.sort
  return k_values.first if k_values.size == 1
  
  # Find the k with the largest drop in inertia
  max_drop = 0
  optimal_k = k_values.first
  
  k_values.each_cons(2) do |k1, k2|
    drop = elbow_results[k1] - elbow_results[k2]
    if drop > max_drop
      max_drop = drop
      optimal_k = k2  # Use k after the drop
    end
  end
  
  optimal_k
end

.elbow_method(data, k_range: 2..10, max_iter: 300) ⇒ Hash

Find optimal number of clusters using elbow method

Parameters:

  • data (Array)

    2D array of data points

  • k_range (Range) (defaults to: 2..10)

    Range of k values to try

  • max_iter (Integer) (defaults to: 300)

    Maximum iterations per k

Returns:

  • (Hash)

    Mapping of k to inertia values



82
83
84
85
86
87
88
89
90
91
92
# File 'lib/clusterkit/clustering.rb', line 82

def elbow_method(data, k_range: 2..10, max_iter: 300)
  results = {}
  
  k_range.each do |k|
    kmeans = new(k: k, max_iter: max_iter)
    kmeans.fit(data)
    results[k] = kmeans.inertia
  end
  
  results
end

.optimal_k(data, k_range: 2..10, max_iter: 300) ⇒ Integer

Find optimal k and return it

Parameters:

  • data (Array)

    2D array of data points

  • k_range (Range) (defaults to: 2..10)

    Range of k values to try (default: 2..10)

  • max_iter (Integer) (defaults to: 300)

    Maximum iterations (default: 300)

Returns:

  • (Integer)

    Optimal number of clusters



124
125
126
127
# File 'lib/clusterkit/clustering.rb', line 124

def optimal_k(data, k_range: 2..10, max_iter: 300)
  elbow_results = elbow_method(data, k_range: k_range, max_iter: max_iter)
  detect_optimal_k(elbow_results)
end

Instance Method Details

#cluster_centersArray

Get cluster centers

Returns:

  • (Array)

    2D array of cluster centers



65
66
67
# File 'lib/clusterkit/clustering.rb', line 65

def cluster_centers
  @centroids
end

#fit(data) ⇒ self

Fit the K-means model

Parameters:

  • data (Array)

    2D array of data points

Returns:

  • (self)

    Returns self for method chaining



29
30
31
32
33
34
35
36
37
# File 'lib/clusterkit/clustering.rb', line 29

def fit(data)
  validate_data(data)
  
  # Call Rust implementation with optional seed
  @labels, @centroids, @inertia = Clustering.kmeans_rust(data, @k, @max_iter, @random_seed)
  @fitted = true
  
  self
end

#fit_predict(data) ⇒ Array

Fit the model and return labels

Parameters:

  • data (Array)

    2D array of data points

Returns:

  • (Array)

    Cluster labels



52
53
54
55
# File 'lib/clusterkit/clustering.rb', line 52

def fit_predict(data)
  fit(data)
  @labels
end

#fitted?Boolean

Check if model has been fitted

Returns:

  • (Boolean)

    True if fitted



59
60
61
# File 'lib/clusterkit/clustering.rb', line 59

def fitted?
  @fitted
end

#predict(data) ⇒ Array

Predict cluster labels for new data

Parameters:

  • data (Array)

    2D array of data points

Returns:

  • (Array)

    Cluster labels

Raises:

  • (RuntimeError)


42
43
44
45
46
47
# File 'lib/clusterkit/clustering.rb', line 42

def predict(data)
  raise RuntimeError, "Model must be fitted before predict" unless fitted?
  validate_data(data)
  
  Clustering.kmeans_predict_rust(data, @centroids)
end