Flock

Ruby bindings to Cluster 3.0

Description

Provides bindings to clustering methods in Cluster 3.0.

* K-Means
* Kohonen Self-Organizing Maps
* Tree Cluster or Hierarchical Clustering

Synopsis

Specify vectors explicitly

require 'pp'
require 'flock'

data     = Array.new(13) {[]}
mask     = Array.new(13) {[]}
weights  = Array.new(13) {1.0}

data[ 0][ 0]=0.1; data[ 0][ 1]=0.0; data[ 0][ 2]=9.6; data[ 0][ 3] = 5.6;
data[ 1][ 0]=1.4; data[ 1][ 1]=1.3; data[ 1][ 2]=0.0; data[ 1][ 3] = 3.8;
data[ 2][ 0]=1.2; data[ 2][ 1]=2.5; data[ 2][ 2]=0.0; data[ 2][ 3] = 4.8;
data[ 3][ 0]=2.3; data[ 3][ 1]=1.5; data[ 3][ 2]=9.2; data[ 3][ 3] = 4.3;
data[ 4][ 0]=1.7; data[ 4][ 1]=0.7; data[ 4][ 2]=9.6; data[ 4][ 3] = 3.4;
data[ 5][ 0]=0.0; data[ 5][ 1]=3.9; data[ 5][ 2]=9.8; data[ 5][ 3] = 5.1;
data[ 6][ 0]=6.7; data[ 6][ 1]=3.9; data[ 6][ 2]=5.5; data[ 6][ 3] = 4.8;
data[ 7][ 0]=0.0; data[ 7][ 1]=6.3; data[ 7][ 2]=5.7; data[ 7][ 3] = 4.3;
data[ 8][ 0]=5.7; data[ 8][ 1]=6.9; data[ 8][ 2]=5.6; data[ 8][ 3] = 4.3;
data[ 9][ 0]=0.0; data[ 9][ 1]=2.2; data[ 9][ 2]=5.4; data[ 9][ 3] = 0.0;
data[10][ 0]=3.8; data[10][ 1]=3.5; data[10][ 2]=5.5; data[10][ 3] = 9.6;
data[11][ 0]=0.0; data[11][ 1]=2.3; data[11][ 2]=3.6; data[11][ 3] = 8.5;
data[12][ 0]=4.1; data[12][ 1]=4.5; data[12][ 2]=5.8; data[12][ 3] = 7.6;

mask[ 0][ 0]=1; mask[ 0][ 1]=1; mask[ 0][ 2]=1; mask[ 0][ 3] = 1;
mask[ 1][ 0]=1; mask[ 1][ 1]=1; mask[ 1][ 2]=0; mask[ 1][ 3] = 1;
mask[ 2][ 0]=1; mask[ 2][ 1]=1; mask[ 2][ 2]=0; mask[ 2][ 3] = 1;
mask[ 3][ 0]=1; mask[ 3][ 1]=1; mask[ 3][ 2]=1; mask[ 3][ 3] = 1;
mask[ 4][ 0]=1; mask[ 4][ 1]=1; mask[ 4][ 2]=1; mask[ 4][ 3] = 1;
mask[ 5][ 0]=0; mask[ 5][ 1]=1; mask[ 5][ 2]=1; mask[ 5][ 3] = 1;
mask[ 6][ 0]=1; mask[ 6][ 1]=1; mask[ 6][ 2]=1; mask[ 6][ 3] = 1;
mask[ 7][ 0]=0; mask[ 7][ 1]=1; mask[ 7][ 2]=1; mask[ 7][ 3] = 1;
mask[ 8][ 0]=1; mask[ 8][ 1]=1; mask[ 8][ 2]=1; mask[ 8][ 3] = 1;
mask[ 9][ 0]=1; mask[ 9][ 1]=1; mask[ 9][ 2]=1; mask[ 9][ 3] = 0;
mask[10][ 0]=1; mask[10][ 1]=1; mask[10][ 2]=1; mask[10][ 3] = 1;
mask[11][ 0]=0; mask[11][ 1]=1; mask[11][ 2]=1; mask[11][ 3] = 1;
mask[12][ 0]=1; mask[12][ 1]=1; mask[12][ 2]=1; mask[12][ 3] = 1;

pp Flock.kcluster(6, data, mask: mask)

# method: (kcluster)
#    - Flock::METHOD_AVERAGE (kmeans, this is the default)
#    - Flock::METHOD_MEDIAN  (kmedians)
# method: (treecluster)
#    - Flock::METHOD_AVERAGE_LINKAGE (default)
#    - Flock::METHOD_SINGLE_LINKAGE
#    - Flock::METHOD_MAXIMUM_LINKAGE
#    - Flock::METHOD_CENTROID_LINKAGE
# metric:
#    - Flock::METRIC_EUCLIDIAN (default)
#    - Flock::METRIC_CITY_BLOCK
#    - Flock::METRIC_CORRELATION
#    - Flock::METRIC_ABSOLUTE_CORRELATION
#    - Flock::METRIC_UNCENTERED_CORRELATION
#    - Flock::METRIC_ABSOLUTE_UNCENTERED_CORRELATION
#    - Flock::METRIC_SPEARMAN
#    - Flock::METRIC_KENDALL
# seed: (initial cluster assignment)
#    - Flock::SEED_RANDOM            (uniform random, this is the default)
#    - Flock::SEED_KMEANS_PLUSPLUS   (kmeans++ - initial cluster centers chosen weighted by distance from closest center)
#    - Flock::SEED_SPREADOUT         (similar to kmeans++ but deterministic, spreads out cluster centers)

pp Flock.kcluster(
  6,
  data,
  mask:      mask,
  method:    Flock::METHOD_AVERAGE,
  metric:    Flock::METRIC_EUCLIDIAN,
  transpose: 0,
  weights:   Array.new(13) {1.0},
  seed:      Flock::SEED_RANDOM
)

pp Flock.treecluster(
  6,
  data,
  mask:      mask,
  method:    Flock::METHOD_AVERAGE,
  metric:    Flock::METRIC_EUCLIDIAN,
  transpose: 0,
  weights:   Array.new(13) {1.0},
)

Sparse data and clustering string labels

require 'pp'
require 'flock'

data = []

# keys don't need to be numeric
data << { 1 => 0.5, 2 => 0.5 }
data << { 3 => 1, 4 => 1 }
data << { 4 => 1, 5 => 0.3 }
data << { 2 => 0.75 }
data << { 1 => 0.60 }

pp Flock.kcluster(2, data, sparse: true)

data = []

# a much simpler way to cluster text labels.
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(orange)
data << %w(apple)

# additional options such as metric, iterations can be passed in a hash.
pp Flock.kcluster(2, data, sparse: true)
pp Flock.treecluster(2, data, sparse: true)

Self-Organizing Map

Self-Organizing Maps (SOM) require that you specify a 2D grid on which data points can cluster. Some of the grid points may be empty and others might have clusters mapped to them. There is no need to provide a fixed cluster size.

require 'pp'
require 'flock'

data = []

# a much simpler way to cluster text
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(orange)
data << %w(apple)

# nxgrid, nygrid, data are required.
# additional options such as metric, iterations can be passed in a hash.

# cluster upto 4 groups in a 2x2 grid.
pp Flock.self_organizing_map(2, 2, data, sparse: true)

Note: SOM clustering provides the 2D grid coordinate for each vector instead of an integer cluster value for each vector like kcluster and treecluster.

Changes from 0.4.x to 0.5.0

Deprecated methods

  • kmeans: use kcluster instead

  • sparse_kmeans: use kcluster with option sparse: true

  • sparse_treecluster: use treecluster with option sparse: true

  • sparse_self_organizing_map: use self_organizing_map with option sparse: true

Method signature

  • kmeans, treecluster and self_organizing_map no longer take mask as a parameter.

  • mask needs to passed along with other optional parameters in the options hash.

TODO

  • K-Tree clustering

  • Use Sparse Matrix instead of converting sparse data into dense matrices.

  • BIRCH hierarchical clustering.

  • EM clustering.

  • kcluster auto-suggest cluster size.

License

Creative Commons Attribution - CC BY