Module: ClusterKit

Defined in:
lib/clusterkit.rb,
lib/clusterkit/hnsw.rb,
lib/clusterkit/utils.rb,
lib/clusterkit/silence.rb,
lib/clusterkit/version.rb,
lib/clusterkit/clustering.rb,
lib/clusterkit/configuration.rb,
lib/clusterkit/preprocessing.rb,
lib/clusterkit/data_validator.rb,
lib/clusterkit/dimensionality.rb,
lib/clusterkit/clustering/hdbscan.rb,
lib/clusterkit/dimensionality/pca.rb,
lib/clusterkit/dimensionality/svd.rb,
lib/clusterkit/hdbscan_api_design.rb,
lib/clusterkit/dimensionality/umap.rb

Overview

API Design for HDBSCAN to match KMeans pattern

Defined Under Namespace

Modules: Clustering, DataValidator, Dimensionality, Preprocessing, Silence, Utils Classes: Configuration, ConvergenceError, DataError, DimensionError, DisconnectedGraphError, Error, HNSW, InsufficientDataError, InvalidParameterError, IsolatedPointError

Constant Summary collapse

VERSION =
"0.2.6"

Class Attribute Summary collapse

Class Method Summary collapse

Class Attribute Details

.configurationObject

Returns the value of attribute configuration.



5
6
7
# File 'lib/clusterkit/configuration.rb', line 5

def configuration
  @configuration
end

Class Method Details

.configure {|configuration| ... } ⇒ Object

Yields:



8
9
10
11
# File 'lib/clusterkit/configuration.rb', line 8

def self.configure
  self.configuration ||= Configuration.new
  yield(configuration) if block_given?
end

.estimate_dimension(data, k: 10) ⇒ Float

Estimate intrinsic dimension of data

Parameters:

  • data (Array, Numo::NArray)

    Input data

  • k (Integer) (defaults to: 10)

    Number of neighbors to consider

Returns:

  • (Float)

    Estimated intrinsic dimension



67
68
69
# File 'lib/clusterkit.rb', line 67

def estimate_dimension(data, k: 10)
  Utils.estimate_intrinsic_dimension(data, k_neighbors: k)
end

.kmeans(data, k: nil, k_range: 2..10, **options) ⇒ Array

Quick K-means with automatic k detection

Parameters:

  • data (Array)

    Input data

  • k (Integer, nil) (defaults to: nil)

    Number of clusters (auto-detect if nil)

  • k_range (Range) (defaults to: 2..10)

    Range for auto-detection

Returns:

  • (Array)

    Cluster labels



86
87
88
89
90
# File 'lib/clusterkit.rb', line 86

def kmeans(data, k: nil, k_range: 2..10, **options)
  k ||= Clustering::KMeans.optimal_k(data, k_range: k_range)
  kmeans = Clustering::KMeans.new(k: k, **options)
  kmeans.fit_predict(data)
end

.pca(data, n_components: 2) ⇒ Array

Quick PCA

Parameters:

  • data (Array)

    Input data

  • n_components (Integer) (defaults to: 2)

    Number of dimensions in output

Returns:

  • (Array)

    Transformed data



52
53
54
55
# File 'lib/clusterkit.rb', line 52

def pca(data, n_components: 2)
  pca = Dimensionality::PCA.new(n_components: n_components)
  pca.fit_transform(data)
end

.svd(matrix, k, n_iter: 2) ⇒ Array

Perform SVD

Parameters:

  • matrix (Array)

    Input matrix

  • k (Integer)

    Number of components

  • n_iter (Integer) (defaults to: 2)

    Number of iterations for randomized algorithm

Returns:

  • (Array)

    U, S, V matrices



76
77
78
79
# File 'lib/clusterkit.rb', line 76

def svd(matrix, k, n_iter: 2)
  svd = Dimensionality::SVD.new(n_components: k, n_iter: n_iter)
  svd.fit_transform(matrix)
end

.tsne(data, n_components: 2, **options) ⇒ Object

Deprecated.

Not implemented - use UMAP instead

t-SNE is not yet implemented

Raises:

  • (NotImplementedError)


59
60
61
# File 'lib/clusterkit.rb', line 59

def tsne(data, n_components: 2, **options)
  raise NotImplementedError, "t-SNE is not yet implemented. Please use UMAP instead, which provides similar dimensionality reduction capabilities."
end

.umap(data, n_components: 2, **options) ⇒ Array

Quick UMAP embedding

Parameters:

  • data (Array)

    Input data

  • n_components (Integer) (defaults to: 2)

    Number of dimensions in output

Returns:

  • (Array)

    Embedded data



43
44
45
46
# File 'lib/clusterkit.rb', line 43

def umap(data, n_components: 2, **options)
  umap = Dimensionality::UMAP.new(n_components: n_components, **options)
  umap.fit_transform(data)
end