Class: Omoikane::KMeans

Inherits:
Object
  • Object
show all
Defined in:
lib/omoikane/clustering.rb

Overview

This class is the interface with the “outside world”. When you create a Clusterer object, it can train itself, add or remove training examples, change the norm used to measure the distance between different elements, and initialize the clusters positions differently.

Usage:

cluster = Omoikane::KMeans.new(dataset, 2, :initial => :random)
cluster.train!
cluster.centroids # => NMatrix with the resulting centroids.

Constant Summary collapse

MAX_ITERATIONS =

The maximum number of iterations made by the clusterer.

100

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dataset, k, initial: :random) ⇒ KMeans

Creates a new clusterer.

  • Arguments :

    • dataset -> NMatrix in which each row correspond to an element and

      each column is an attribute.
      
    • k -> The number of clusters to be trained.

    • initial ->


26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/omoikane/clustering.rb', line 26

def initialize(dataset, k, initial: :random)
  @dataset = dataset
  @n = dataset.rows
  @d = dataset.cols
  @k = k

  # Store the data vector -> centroid relationship. Each index corresponds
  # to a training sample and the value to a cluster index.
  @clusters = NMatrix.zeros([@dataset.rows, 1], dtype: :int64)
  @old_clusters = @clusters

  # I'll implement different methods in the future. I promise.
  case initial
  when :random
    @centroids = NMatrix.random([@k, @dataset.cols])
  end
end

Instance Attribute Details

#centroidsObject

Returns the value of attribute centroids


13
14
15
# File 'lib/omoikane/clustering.rb', line 13

def centroids
  @centroids
end

#clustersObject

Returns the value of attribute clusters


13
14
15
# File 'lib/omoikane/clustering.rb', line 13

def clusters
  @clusters
end

#dObject (readonly)

Returns the value of attribute d


14
15
16
# File 'lib/omoikane/clustering.rb', line 14

def d
  @d
end

#datasetObject

Returns the value of attribute dataset


13
14
15
# File 'lib/omoikane/clustering.rb', line 13

def dataset
  @dataset
end

#kObject (readonly)

Returns the value of attribute k


14
15
16
# File 'lib/omoikane/clustering.rb', line 14

def k
  @k
end

#nObject (readonly)

Returns the value of attribute n


14
15
16
# File 'lib/omoikane/clustering.rb', line 14

def n
  @n
end

#old_clustersObject

Returns the value of attribute old_clusters


13
14
15
# File 'lib/omoikane/clustering.rb', line 13

def old_clusters
  @old_clusters
end

Instance Method Details

#classify(element) ⇒ Object

call-seq:

classify(element) -> Fixnum

Return the index of the more similar cluster to the given element.

  • Arguments :

    • element -> A [d,1] NMatrix that represents the element to be

    classified.


88
89
90
91
92
93
94
95
# File 'lib/omoikane/clustering.rb', line 88

def classify(element)
  dists = []
  @centroids.rows.times do |centroid_idx|
    dists << distance(@centroids.row(centroid_idx), element)
  end

  dists.index(dists.min)
end

#distance(u, v) ⇒ Object

call-seq:

distance(u, v) -> Float

Calculate the Euclidean distance between elements u and v.

  • Arguments:

    • u, v -> [d,1] (or [1,d])-shaped NMatrices.


104
105
106
# File 'lib/omoikane/clustering.rb', line 104

def distance(u, v)
  Measurable.euclidean(u, v)
end

#train!Object

call-seq:

train! -> NMatrix

Cluster the dataset and return the centroids.


48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/omoikane/clustering.rb', line 48

def train!
  # Repeat until convergence.
  MAX_ITERATIONS.times do
    @old_clusters = @clusters.clone

    # Assignment step: Assign each data point to the most similar cluster.
    @dataset.rows.times do |i|
      @clusters[i] = classify(@dataset.row(i))
    end

    # Update step: Calculate the mean of each cluster to get the new
    # centroids.
    items_in_cluster = NMatrix.zeros([@k, @dataset.cols], dtype: :int64)

    # Count how many examples are in each cluster and sum their attributes.
    @dataset.rows.times do |example_idx|
      centroid_idx = @clusters[example_idx]

      items_in_cluster[centroid_idx, 0...items_in_cluster.cols] += 1

      @centroids[centroid_idx, 0...@d] += @dataset[example_idx, 0...@d]
    end

    # Divide each attribute of the centroids to get the new ones.
    @centroids /= items_in_cluster

    # Stop if the clusters remain the same.
    break if @old_clusters == @clusters
  end
  @centroids
end