Module: Flock
- Defined in:
- lib/flock.rb,
ext/flock.c
Overview
Ruby bindings to data clustering algorithms provided by Cluster 3.0
Algorithms implemented
-
K-Means, K-Medians, K-Means++
-
Self-Organizing Maps
-
Tree Cluster or Hierarchical Clustering
Synopsis
require 'pp'
require 'flock'
# sparse data.
data = []
data << %w(apple orange)
data << %w(black white)
data << %w(white cyan)
data << %w(apple orange)
data << %w(apple)
pp Flock.kcluster(2, data, sparse: true, seed: Flock::SEED_RANDOM)
pp Flock.kcluster(2, data, sparse: true, seed: Flock::SEED_KMEANS_PLUSPLUS)
pp Flock.kcluster(2, data, sparse: true, seed: Flock::SEED_SPREADOUT)
# dense data.
data = Array.new(13) {[]}
mask = Array.new(13) {[]}
weights = Array.new(13) {1.0}
data[0][0] = 0.1; data[0][1] = 0.0;
data[1][0] = 1.4; data[1][1] = 1.3;
data[2][0] = 1.2; data[2][1] = 2.5;
data[3][0] = 2.3; data[3][1] = 1.5;
data[4][0] = 1.7; data[4][1] = 0.7;
data[5][0] = 0.0; data[5][1] = 3.9;
data[6][0] = 6.7; data[6][1] = 3.9;
mask[0][0] = 1; mask[0][1] = 1;
mask[1][0] = 1; mask[1][1] = 1;
mask[2][0] = 1; mask[2][1] = 1;
mask[3][0] = 1; mask[3][1] = 1;
mask[4][0] = 1; mask[4][1] = 1;
mask[5][0] = 0; mask[5][1] = 1;
mask[6][0] = 1; mask[6][1] = 1;
pp Flock.kcluster(2, data, mask: mask, weights: weights)
See
-
examples/* for more examples.
-
README.rdoc for more details.
-
API.rdoc is a public API overview.
Constant Summary collapse
- METHOD_AVERAGE =
kcluster method - K-Means
INT2NUM('a')
- METHOD_MEDIAN =
kcluster method - K-Medians
INT2NUM('m')
- METHOD_SINGLE_LINKAGE =
treecluster method - pairwise single-linkage clustering
INT2NUM('s')
- METHOD_MAXIMUM_LINKAGE =
treecluster method - pairwise maximum- (or complete-) linkage clustering
INT2NUM('m')
- METHOD_AVERAGE_LINKAGE =
treecluster method - pairwise average-linkage clustering
INT2NUM('a')
- METHOD_CENTROID_LINKAGE =
treecluster method - pairwise centroid-linkage clustering
INT2NUM('c')
- METRIC_EUCLIDIAN =
INT2NUM('e')
- METRIC_CITY_BLOCK =
INT2NUM('b')
- METRIC_CORRELATION =
INT2NUM('c')
- METRIC_ABSOLUTE_CORRELATION =
INT2NUM('a')
- METRIC_UNCENTERED_CORRELATION =
INT2NUM('u')
- METRIC_ABSOLUTE_UNCENTERED_CORRELATION =
INT2NUM('x')
- METRIC_SPEARMAN =
INT2NUM('s')
- METRIC_KENDALL =
INT2NUM('k')
- SEED_RANDOM =
Randomly assign data points to clusters using a uniform distribution.
INT2NUM(0)
- SEED_KMEANS_PLUSPLUS =
K-Means++ style initialization where data points are probabilistically assigned to clusters based on their distance from closest cluster.
INT2NUM(1)
- SEED_SPREADOUT =
Deterministic cluster assignment by spreading out initial clusters as far away from each other as possible.
INT2NUM(2)
Class Method Summary collapse
-
.absolute_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Absolute correlation distance measure.
-
.absolute_uncentered_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Absolute uncentered correlation distance measure.
-
.cityblock_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Cityblock distance measure.
-
.correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Correlation distance measure.
-
.euclidian_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Euclidian distance measure.
-
.kcluster(size, data, options = {}) ⇒ Hash
Cluster using k-means and k-medians.
-
.kendall_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Kendall distance measure.
-
.kmeans(size, data, options = {}) ⇒ Object
deprecated
Deprecated.
use Flock.kcluster instead.
-
.self_organizing_map(nx, ny, data, options = {}) ⇒ Hash
Arranges data points on a 2D grid without having to specify a fixed cluster size.
-
.sparse_kmeans(size, data, options = {}) ⇒ Object
deprecated
Deprecated.
use Flock.kcluster(size, data, sparse: true, …) instead.
-
.sparse_self_organizing_map(nx, ny, data, options = {}) ⇒ Object
deprecated
Deprecated.
use Flock.self_organizing_map(nx, ny, data, sparse: true, …) instead.
-
.sparse_treecluster(size, data, options = {}) ⇒ Object
deprecated
Deprecated.
use Flock.treecluster(size, data, sparse: true, …) instead.
-
.spearman_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Spearman distance measure.
-
.treecluster(size, data, options = {}) ⇒ Hash
Clusters data into hierarchies and then returns the clusters required using cut-tree.
-
.uncentered_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Uncentered correlation distance measure.
Class Method Details
.absolute_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Absolute correlation distance measure
513 514 515 516 517 |
# File 'ext/flock.c', line 513 VALUE rb_acorrelation(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, acorrelation); } |
.absolute_uncentered_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Absolute uncentered correlation distance measure
528 529 530 531 532 |
# File 'ext/flock.c', line 528 VALUE rb_uacorrelation(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, uacorrelation); } |
.cityblock_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Cityblock distance measure
468 469 470 471 472 |
# File 'ext/flock.c', line 468 VALUE rb_cityblock(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, cityblock); } |
.correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Correlation distance measure
483 484 485 486 487 |
# File 'ext/flock.c', line 483 VALUE rb_correlation(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, correlation); } |
.euclidian_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Euclidian distance measure
453 454 455 456 457 |
# File 'ext/flock.c', line 453 VALUE rb_euclid(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, euclid); } |
.kcluster(size, data, options = {}) ⇒ Hash
Cluster using k-means and k-medians.
104 105 106 107 108 109 110 111 |
# File 'lib/flock.rb', line 104 def self.kcluster size, data, = {} [:sparse] = true if sparse?(data[0]) if [:sparse] data, [:weights] = densify(data, [:weights]) [:mask] = nil end do_kcluster(size, data, ) end |
.kendall_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Kendall distance measure
558 559 560 561 562 |
# File 'ext/flock.c', line 558 VALUE rb_kendall(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, kendall); } |
.kmeans(size, data, options = {}) ⇒ Object
use kcluster instead.
185 186 187 |
# File 'lib/flock.rb', line 185 def self.kmeans size, data, = {} kcluster(size, data, ) end |
.self_organizing_map(nx, ny, data, options = {}) ⇒ Hash
Arranges data points on a 2D grid without having to specify a fixed cluster size. So in theory you could have a maximum of nxm clusters.
139 140 141 142 143 144 145 146 |
# File 'lib/flock.rb', line 139 def self.self_organizing_map nx, ny, data, = {} [:sparse] = true if sparse?(data[0]) if [:sparse] data, [:weights] = densify(data, [:weights]) [:mask] = nil end do_self_organizing_map(nx, ny, data, ) end |
.sparse_kmeans(size, data, options = {}) ⇒ Object
use kcluster(size, data, sparse: true, …) instead.
190 191 192 |
# File 'lib/flock.rb', line 190 def self.sparse_kmeans size, data, = {} kcluster(size, data, .merge(sparse: true)) end |
.sparse_self_organizing_map(nx, ny, data, options = {}) ⇒ Object
use self_organizing_map(nx, ny, data, sparse: true, …) instead.
200 201 202 |
# File 'lib/flock.rb', line 200 def self.sparse_self_organizing_map nx, ny, data, = {} self_organizing_map(nx, ny, data, .merge(sparse: true)) end |
.sparse_treecluster(size, data, options = {}) ⇒ Object
use treecluster(size, data, sparse: true, …) instead.
195 196 197 |
# File 'lib/flock.rb', line 195 def self.sparse_treecluster size, data, = {} treecluster(size, data, .merge(sparse: true)) end |
.spearman_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Spearman distance measure
543 544 545 546 547 |
# File 'ext/flock.c', line 543 VALUE rb_spearman(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, spearman); } |
.treecluster(size, data, options = {}) ⇒ Hash
Clusters data into hierarchies and then returns the clusters required using cut-tree.
175 176 177 178 179 180 181 182 |
# File 'lib/flock.rb', line 175 def self.treecluster size, data, = {} [:sparse] = true if sparse?(data[0]) if [:sparse] data, [:weights] = densify(data, [:weights]) [:mask] = nil end do_treecluster(size, data, ) end |
.uncentered_correlation_distance(vector1, vector2, mask1 = identity, mask2 = identity) ⇒ Object
Uncentered correlation distance measure
498 499 500 501 502 |
# File 'ext/flock.c', line 498 VALUE rb_ucorrelation(int argc, VALUE *argv, VALUE self) { VALUE v1, v2, m1, m2; rb_scan_args(argc, argv, "22", &v1, &v2, &m1, &m2); return rb_distance(v1, m1, v2, m2, ucorrelation); } |