Module: Flock

Defined in:
lib/flock.rb,
ext/flock.c

Overview

Ruby bindings to data clustering algorithms provided by Cluster 3.0

Algorithms implemented

  • K-Means, K-Medians, K-Means++

  • Self-Organizing Maps

  • Tree Cluster or Hierarchical Clustering

Synopsis

require 'pp'
require 'flock'

# sparse data.
data = []
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(apple orange)
data << %w(apple)

pp Flock.kcluster(2, data, sparse: true, seed: Flock::SEED_RANDOM)
pp Flock.kcluster(2, data, sparse: true, seed: Flock::SEED_KMEANS_PLUSPLUS)
pp Flock.kcluster(2, data, sparse: true, seed: Flock::SEED_SPREADOUT)

# dense data.
data     = Array.new(13) {[]}
mask     = Array.new(13) {[]}
weights  = Array.new(13) {1.0}

data[0][0] = 0.1; data[0][1] = 0.0;
data[1][0] = 1.4; data[1][1] = 1.3;
data[2][0] = 1.2; data[2][1] = 2.5;
data[3][0] = 2.3; data[3][1] = 1.5;
data[4][0] = 1.7; data[4][1] = 0.7;
data[5][0] = 0.0; data[5][1] = 3.9;
data[6][0] = 6.7; data[6][1] = 3.9;

mask[0][0] = 1;   mask[0][1] = 1;
mask[1][0] = 1;   mask[1][1] = 1;
mask[2][0] = 1;   mask[2][1] = 1;
mask[3][0] = 1;   mask[3][1] = 1;
mask[4][0] = 1;   mask[4][1] = 1;
mask[5][0] = 0;   mask[5][1] = 1;
mask[6][0] = 1;   mask[6][1] = 1;

pp Flock.kcluster(2, data, mask: mask, weights: weights)

See

  • examples/* for more examples.

  • README.rdoc for more details.

  • API.rdoc is a public API overview.

Constant Summary collapse

METHOD_AVERAGE =

kcluster method - K-Means

INT2NUM('a')
METHOD_MEDIAN =

kcluster method - K-Medians

INT2NUM('m')
METHOD_SINGLE_LINKAGE =

treecluster method - pairwise single-linkage clustering

INT2NUM('s')
METHOD_MAXIMUM_LINKAGE =

treecluster method - pairwise maximum- (or complete-) linkage clustering

INT2NUM('m')
METHOD_AVERAGE_LINKAGE =

treecluster method - pairwise average-linkage clustering

INT2NUM('a')
METHOD_CENTROID_LINKAGE =

treecluster method - pairwise centroid-linkage clustering

INT2NUM('c')
METRIC_EUCLIDIAN =
INT2NUM('e')
METRIC_CITY_BLOCK =
INT2NUM('b')
METRIC_CORRELATION =
INT2NUM('c')
METRIC_ABSOLUTE_CORRELATION =
INT2NUM('a')
METRIC_UNCENTERED_CORRELATION =
INT2NUM('u')
METRIC_ABSOLUTE_UNCENTERED_CORRELATION =
INT2NUM('x')
METRIC_SPEARMAN =
INT2NUM('s')
METRIC_KENDALL =
INT2NUM('k')
SEED_RANDOM =

Randomly assign data points to clusters using a uniform distribution.

INT2NUM(0)
SEED_KMEANS_PLUSPLUS =

K-Means++ style initialization where data points are probabilistically assigned to clusters based on their distance from closest cluster.

INT2NUM(1)
SEED_SPREADOUT =

Deterministic cluster assignment by spreading out initial clusters as far away from each other as possible.

INT2NUM(2)

Class Method Summary collapse

Class Method Details

.absolute_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Absolute correlation distance measure

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



513
514
515
516
517
# File 'ext/flock.c', line 513

VALUE rb_acorrelation(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, acorrelation);
}

.absolute_uncentered_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Absolute uncentered correlation distance measure

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



528
529
530
531
532
# File 'ext/flock.c', line 528

VALUE rb_uacorrelation(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, uacorrelation);
}

.cityblock_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Cityblock distance measure

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



468
469
470
471
472
# File 'ext/flock.c', line 468

VALUE rb_cityblock(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, cityblock);
}

.correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Correlation distance measure

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



483
484
485
486
487
# File 'ext/flock.c', line 483

VALUE rb_correlation(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, correlation);
}

.euclidian_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Euclidian distance measure

Examples:

Flock.euclidian_distance([0, 0], [1, 1])
Flock.euclidian_distance([0, 0, 0], [1, 1, 1], [1, 1, 0], [1, 1, 0]) # with mask

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



453
454
455
456
457
# File 'ext/flock.c', line 453

VALUE rb_euclid(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, euclid);
}

.kcluster(size, data, options = {}) ⇒ Hash

Cluster using k-means and k-medians.

Examples:


data = []
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(apple orange)
data << %w(apple)
result = Flock.kcluster(2, data, sparse: true, seed: Flock::SEED_RANDOM)

Parameters:

  • size (Fixnum)

    number of clusters the data points are grouped into.

  • data (Array)

    An array of arrays of sparse or dense data, or an array of hashes of sparse data. Dense data should always be in numeric form. Sparse data values are converted to a dense row format by looking at the unique values and then converting each data point into a numeric vector that represents the presence or absence of a value in that data point.

  • options (Hash) (defaults to: {})

    a customizable set of options

Options Hash (options):

  • :mask (Array)

    An array of arrays of 1s and 0s denoting if an element in the datapoint is to be used for computing distance (defaults to: all 1 vectors).

  • :weights (Array)

    Numeric weight for each data point (defaults to: all 1 vector).

  • :transpose (true, false)

    Transpose the dense data matrix (defaults to: false).

  • :iterations (Fixnum)

    Number of iterations to be run (defaults to: 100).

  • :method (Fixnum)

    Clustering method

    • Flock::METHOD_AVERAGE (default)

    • Flock::METHOD_MEDIAN

  • :metric (Fixnum)

    Distance measure, one of the following

    • Flock::METRIC_EUCLIDIAN (default)

    • Flock::METRIC_CITY_BLOCK

    • Flock::METRIC_CORRELATION

    • Flock::METRIC_ABSOLUTE_CORRELATION

    • Flock::METRIC_UNCENTERED_CORRELATION

    • Flock::METRIC_ABSOLUTE_UNCENTERED_CORRELATION

    • Flock::METRIC_SPEARMAN

    • Flock::METRIC_KENDALL

  • :seed (Fixnum)

    Initial seeding of clusters

    • Flock::SEED_RANDOM (default)

    • Flock::SEED_KMEANS_PLUSPLUS

    • Flock::SEED_SPREADOUT

Returns:

  • (Hash)

    :cluster  => [Array],
    :centroid => [Array<Array>],
    :error    => [Numeric],
    :repeated => [Fixnum]
    



104
105
106
107
108
109
110
111
# File 'lib/flock.rb', line 104

def self.kcluster size, data, options = {}
  options[:sparse] = true if sparse?(data[0])
  if options[:sparse]
    data, options[:weights] = densify(data, options[:weights])
    options[:mask]          = nil
  end
  do_kcluster(size, data, options)
end

.kendall_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Kendall distance measure

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



558
559
560
561
562
# File 'ext/flock.c', line 558

VALUE rb_kendall(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, kendall);
}

.kmeans(size, data, options = {}) ⇒ Object

Deprecated.

use kcluster instead.



185
186
187
# File 'lib/flock.rb', line 185

def self.kmeans size, data, options = {}
  kcluster(size, data, options)
end

.self_organizing_map(nx, ny, data, options = {}) ⇒ Hash

Arranges data points on a 2D grid without having to specify a fixed cluster size. So in theory you could have a maximum of nxm clusters.

Examples:


data = []
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(apple orange)
data << %w(apple)
result = Flock.self_organizing_map(2, 2, data, sparse: true)

Parameters:

  • nx (Fixnum)

    Grid size in 1st dimension (x)

  • ny (Fixnum)

    Grid size in 2nd dimension (y)

  • data (Array)

    See Flock#kcluster

  • options (Hash) (defaults to: {})

    a customizable set of options

Options Hash (options):

  • :mask (Array)

    See Flock#kcluster

  • :transpose (true, false)

    See Flock#kcluster

  • :iterations (Fixnum)

    See Flock#kcluster

  • :metric (Fixnum)

    See Flock#kcluster

  • :tau (Numeric)

    Initial tau value for distance metric.

Returns:

  • (Hash)

    :cluster  => [Array<Array>],
    :centroid => [Array<Array>]
    



139
140
141
142
143
144
145
146
# File 'lib/flock.rb', line 139

def self.self_organizing_map nx, ny, data, options = {}
  options[:sparse] = true if sparse?(data[0])
  if options[:sparse]
    data, options[:weights] = densify(data, options[:weights])
    options[:mask]          = nil
  end
  do_self_organizing_map(nx, ny, data, options)
end

.sparse_kmeans(size, data, options = {}) ⇒ Object

Deprecated.

use kcluster(size, data, sparse: true, …) instead.



190
191
192
# File 'lib/flock.rb', line 190

def self.sparse_kmeans size, data, options = {}
  kcluster(size, data, options.merge(sparse: true))
end

.sparse_self_organizing_map(nx, ny, data, options = {}) ⇒ Object

Deprecated.

use self_organizing_map(nx, ny, data, sparse: true, …) instead.



200
201
202
# File 'lib/flock.rb', line 200

def self.sparse_self_organizing_map nx, ny, data, options = {}
  self_organizing_map(nx, ny, data, options.merge(sparse: true))
end

.sparse_treecluster(size, data, options = {}) ⇒ Object

Deprecated.

use treecluster(size, data, sparse: true, …) instead.



195
196
197
# File 'lib/flock.rb', line 195

def self.sparse_treecluster size, data, options = {}
  treecluster(size, data, options.merge(sparse: true))
end

.spearman_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Spearman distance measure

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



543
544
545
546
547
# File 'ext/flock.c', line 543

VALUE rb_spearman(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, spearman);
}

.treecluster(size, data, options = {}) ⇒ Hash

Clusters data into hierarchies and then returns the clusters required using cut-tree.

Examples:


data = []
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(apple orange)
data << %w(apple)
result = Flock.treecluster(2, data, sparse: true)

Parameters:

  • size (Fixnum)

    Number of clusters required. (See Flock#kcluster)

  • data (Array)

    See Flock#kcluster

  • options (Hash) (defaults to: {})

    a customizable set of options

Options Hash (options):

  • :mask (Array)

    See Flock#kcluster

  • :transpose (true, false)

    See Flock#kcluster

  • :iterations (Fixnum)

    See Flock#kcluster

  • :metric (Fixnum)

    See Flock#kcluster

  • :method (Fixnum)

    Method to use for treecluster

    • Flock::METHOD_SINGLE_LINKAGE

    • Flock::METHOD_MAXIMUM_LINKAGE

    • Flock::METHOD_AVERAGE_LINKAGE (default)

    • Flock::METHOD_CENTROID_LINKAGE

Returns:

  • (Hash)

    :cluster => [Array]
    



175
176
177
178
179
180
181
182
# File 'lib/flock.rb', line 175

def self.treecluster size, data, options = {}
  options[:sparse] = true if sparse?(data[0])
  if options[:sparse]
    data, options[:weights] = densify(data, options[:weights])
    options[:mask]          = nil
  end
  do_treecluster(size, data, options)
end

.uncentered_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object

Uncentered correlation distance measure

Parameters:

  • vector1 (Array)

    Numeric vector

  • vector2 (Array)

    Numeric vector

  • mask1 (Array) (defaults to: identity)

    Optional mask for vector1

  • mask2 (Array) (defaults to: identity)

    Optional mask for vector2



498
499
500
501
502
# File 'ext/flock.c', line 498

VALUE rb_ucorrelation(int argc, VALUE *argv, VALUE self) {
    VALUE v1, v2, m1, m2;
    rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2);
    return rb_distance(v1, m1, v2, m2, ucorrelation);
}