Flock
Ruby bindings to Cluster 3.0
Description
Provides bindings to clustering methods in Cluster 3.0.
* K-Means
* Kohonen Self-Organizing Maps
* Tree Cluster or Hierarchical Clustering
Synopsis
Specify vectors explicitly
require 'pp'
require 'flock'
data = Array.new(13) {[]}
mask = Array.new(13) {[]}
weights = Array.new(13) {1.0}
data[ 0][ 0]=0.1; data[ 0][ 1]=0.0; data[ 0][ 2]=9.6; data[ 0][ 3] = 5.6;
data[ 1][ 0]=1.4; data[ 1][ 1]=1.3; data[ 1][ 2]=0.0; data[ 1][ 3] = 3.8;
data[ 2][ 0]=1.2; data[ 2][ 1]=2.5; data[ 2][ 2]=0.0; data[ 2][ 3] = 4.8;
data[ 3][ 0]=2.3; data[ 3][ 1]=1.5; data[ 3][ 2]=9.2; data[ 3][ 3] = 4.3;
data[ 4][ 0]=1.7; data[ 4][ 1]=0.7; data[ 4][ 2]=9.6; data[ 4][ 3] = 3.4;
data[ 5][ 0]=0.0; data[ 5][ 1]=3.9; data[ 5][ 2]=9.8; data[ 5][ 3] = 5.1;
data[ 6][ 0]=6.7; data[ 6][ 1]=3.9; data[ 6][ 2]=5.5; data[ 6][ 3] = 4.8;
data[ 7][ 0]=0.0; data[ 7][ 1]=6.3; data[ 7][ 2]=5.7; data[ 7][ 3] = 4.3;
data[ 8][ 0]=5.7; data[ 8][ 1]=6.9; data[ 8][ 2]=5.6; data[ 8][ 3] = 4.3;
data[ 9][ 0]=0.0; data[ 9][ 1]=2.2; data[ 9][ 2]=5.4; data[ 9][ 3] = 0.0;
data[10][ 0]=3.8; data[10][ 1]=3.5; data[10][ 2]=5.5; data[10][ 3] = 9.6;
data[11][ 0]=0.0; data[11][ 1]=2.3; data[11][ 2]=3.6; data[11][ 3] = 8.5;
data[12][ 0]=4.1; data[12][ 1]=4.5; data[12][ 2]=5.8; data[12][ 3] = 7.6;
mask[ 0][ 0]=1; mask[ 0][ 1]=1; mask[ 0][ 2]=1; mask[ 0][ 3] = 1;
mask[ 1][ 0]=1; mask[ 1][ 1]=1; mask[ 1][ 2]=0; mask[ 1][ 3] = 1;
mask[ 2][ 0]=1; mask[ 2][ 1]=1; mask[ 2][ 2]=0; mask[ 2][ 3] = 1;
mask[ 3][ 0]=1; mask[ 3][ 1]=1; mask[ 3][ 2]=1; mask[ 3][ 3] = 1;
mask[ 4][ 0]=1; mask[ 4][ 1]=1; mask[ 4][ 2]=1; mask[ 4][ 3] = 1;
mask[ 5][ 0]=0; mask[ 5][ 1]=1; mask[ 5][ 2]=1; mask[ 5][ 3] = 1;
mask[ 6][ 0]=1; mask[ 6][ 1]=1; mask[ 6][ 2]=1; mask[ 6][ 3] = 1;
mask[ 7][ 0]=0; mask[ 7][ 1]=1; mask[ 7][ 2]=1; mask[ 7][ 3] = 1;
mask[ 8][ 0]=1; mask[ 8][ 1]=1; mask[ 8][ 2]=1; mask[ 8][ 3] = 1;
mask[ 9][ 0]=1; mask[ 9][ 1]=1; mask[ 9][ 2]=1; mask[ 9][ 3] = 0;
mask[10][ 0]=1; mask[10][ 1]=1; mask[10][ 2]=1; mask[10][ 3] = 1;
mask[11][ 0]=0; mask[11][ 1]=1; mask[11][ 2]=1; mask[11][ 3] = 1;
mask[12][ 0]=1; mask[12][ 1]=1; mask[12][ 2]=1; mask[12][ 3] = 1;
pp Flock.kcluster(6, data, mask: mask)
# method: (kcluster)
# - Flock::METHOD_AVERAGE (kmeans, this is the default)
# - Flock::METHOD_MEDIAN (kmedians)
# method: (treecluster)
# - Flock::METHOD_AVERAGE_LINKAGE (default)
# - Flock::METHOD_SINGLE_LINKAGE
# - Flock::METHOD_MAXIMUM_LINKAGE
# - Flock::METHOD_CENTROID_LINKAGE
# metric:
# - Flock::METRIC_EUCLIDIAN (default)
# - Flock::METRIC_CITY_BLOCK
# - Flock::METRIC_CORRELATION
# - Flock::METRIC_ABSOLUTE_CORRELATION
# - Flock::METRIC_UNCENTERED_CORRELATION
# - Flock::METRIC_ABSOLUTE_UNCENTERED_CORRELATION
# - Flock::METRIC_SPEARMAN
# - Flock::METRIC_KENDALL
# seed: (initial cluster assignment)
# - Flock::SEED_RANDOM (uniform random, this is the default)
# - Flock::SEED_KMEANS_PLUSPLUS (kmeans++ - initial cluster centers chosen weighted by distance from closest center)
# - Flock::SEED_SPREADOUT (similar to kmeans++ but deterministic, spreads out cluster centers)
pp Flock.kcluster(
6,
data,
mask: mask,
method: Flock::METHOD_AVERAGE,
metric: Flock::METRIC_EUCLIDIAN,
transpose: 0,
weights: Array.new(13) {1.0},
seed: Flock::SEED_RANDOM
)
pp Flock.treecluster(
6,
data,
mask: mask,
method: Flock::METHOD_AVERAGE,
metric: Flock::METRIC_EUCLIDIAN,
transpose: 0,
weights: Array.new(13) {1.0},
)
Sparse data and clustering string labels
require 'pp'
require 'flock'
data = []
# keys don't need to be numeric
data << { 1 => 0.5, 2 => 0.5 }
data << { 3 => 1, 4 => 1 }
data << { 4 => 1, 5 => 0.3 }
data << { 2 => 0.75 }
data << { 1 => 0.60 }
pp Flock.kcluster(2, data, sparse: true)
data = []
# a much simpler way to cluster text labels.
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(orange)
data << %w(apple)
# additional options such as metric, iterations can be passed in a hash.
pp Flock.kcluster(2, data, sparse: true)
pp Flock.treecluster(2, data, sparse: true)
Self-Organizing Map
Self-Organizing Maps (SOM) require that you specify a 2D grid on which data points can cluster. Some of the grid points may be empty and others might have clusters mapped to them. There is no need to provide a fixed cluster size.
require 'pp'
require 'flock'
data = []
# a much simpler way to cluster text
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(orange)
data << %w(apple)
# nxgrid, nygrid, data are required.
# additional options such as metric, iterations can be passed in a hash.
# cluster upto 4 groups in a 2x2 grid.
pp Flock.self_organizing_map(2, 2, data, sparse: true)
Note: SOM clustering provides the 2D grid coordinate for each vector instead of an integer cluster value for each vector like kcluster and treecluster.
Changes from 0.4.x to 0.5.0
Deprecated methods
-
kmeans: use kcluster instead
-
sparse_kmeans: use kcluster with option sparse: true
-
sparse_treecluster: use treecluster with option sparse: true
-
sparse_self_organizing_map: use self_organizing_map with option sparse: true
Method signature
-
kmeans, treecluster and self_organizing_map no longer take mask as a parameter.
-
mask needs to passed along with other optional parameters in the options hash.
TODO
-
Use Sparse Matrix instead of converting sparse data into dense matrices.
-
BIRCH hierarchical clustering.
-
EM clustering.
-
kcluster auto-suggest cluster size.