Class: ClusterKit::Dimensionality::UMAP

Inherits:
Object
  • Object
show all
Defined in:
lib/clusterkit/dimensionality/umap.rb

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(n_components: 2, n_neighbors: 15, random_seed: nil, nb_grad_batch: 10, nb_sampling_by_edge: 8) ⇒ UMAP

Initialize a new UMAP instance

Parameters:

  • n_components (Integer) (defaults to: 2)

    Target number of dimensions (default: 2)

  • n_neighbors (Integer) (defaults to: 15)

    Number of neighbors for manifold approximation (default: 15)

  • random_seed (Integer, nil) (defaults to: nil)

    Random seed for reproducibility (default: nil)

  • nb_grad_batch (Integer) (defaults to: 10)

    Number of gradient descent batches (default: 10) Controls training iterations - lower = faster but less accurate

  • nb_sampling_by_edge (Integer) (defaults to: 8)

    Number of negative samples per edge (default: 8) Controls sampling quality - lower = faster but less accurate



22
23
24
25
26
27
28
29
30
31
32
# File 'lib/clusterkit/dimensionality/umap.rb', line 22

def initialize(n_components: 2, n_neighbors: 15, random_seed: nil,
               nb_grad_batch: 10, nb_sampling_by_edge: 8)
  @n_components = n_components
  @n_neighbors = n_neighbors
  @random_seed = random_seed
  @nb_grad_batch = nb_grad_batch
  @nb_sampling_by_edge = nb_sampling_by_edge
  @fitted = false
  # Don't create RustUMAP yet - will be created in fit/fit_transform with adjusted parameters
  @rust_umap = nil
end

Instance Attribute Details

#n_componentsObject (readonly)

Returns the value of attribute n_components.



12
13
14
# File 'lib/clusterkit/dimensionality/umap.rb', line 12

def n_components
  @n_components
end

#n_neighborsObject (readonly)

Returns the value of attribute n_neighbors.



12
13
14
# File 'lib/clusterkit/dimensionality/umap.rb', line 12

def n_neighbors
  @n_neighbors
end

#nb_grad_batchObject (readonly)

Returns the value of attribute nb_grad_batch.



12
13
14
# File 'lib/clusterkit/dimensionality/umap.rb', line 12

def nb_grad_batch
  @nb_grad_batch
end

#nb_sampling_by_edgeObject (readonly)

Returns the value of attribute nb_sampling_by_edge.



12
13
14
# File 'lib/clusterkit/dimensionality/umap.rb', line 12

def nb_sampling_by_edge
  @nb_sampling_by_edge
end

#random_seedObject (readonly)

Returns the value of attribute random_seed.



12
13
14
# File 'lib/clusterkit/dimensionality/umap.rb', line 12

def random_seed
  @random_seed
end

Class Method Details

.load_data(path) ⇒ Array<Array<Float>>

Load transformed data from JSON file

Parameters:

  • path (String)

    Path to the saved data

Returns:

  • (Array<Array<Float>>)

    The loaded data

Raises:

  • (ArgumentError)

    If file doesn’t exist



153
154
155
156
# File 'lib/clusterkit/dimensionality/umap.rb', line 153

def self.load_data(path)
  raise ArgumentError, "File not found: #{path}" unless File.exist?(path)
  JSON.parse(File.read(path))
end

.load_model(path) ⇒ UMAP

Load a fitted model from a file

Parameters:

  • path (String)

    Path to the saved model

Returns:

  • (UMAP)

    A new UMAP instance with the loaded model

Raises:

  • (ArgumentError)

    If file doesn’t exist



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/clusterkit/dimensionality/umap.rb', line 123

def self.load_model(path)
  raise ArgumentError, "File not found: #{path}" unless File.exist?(path)

  # Load the Rust model (access private constant)
  rust_umap = ::ClusterKit.const_get(:RustUMAP).load_model(path)

  # Create a new UMAP instance with the loaded model
  instance = allocate
  instance.instance_variable_set(:@rust_umap, rust_umap)
  instance.instance_variable_set(:@fitted, true)
  # The model file should contain these parameters, but for now we don't have access
  instance.instance_variable_set(:@n_components, nil)
  instance.instance_variable_set(:@n_neighbors, nil)
  instance.instance_variable_set(:@random_seed, nil)

  instance
end

.save_data(data, path) ⇒ Object

Save transformed data to JSON file

Parameters:

  • data (Array<Array<Float>>)

    Transformed data to save

  • path (String)

    Path where to save the data



144
145
146
147
# File 'lib/clusterkit/dimensionality/umap.rb', line 144

def self.save_data(data, path)
  FileUtils.mkdir_p(File.dirname(path)) unless File.dirname(path) == '.'
  File.write(path, JSON.pretty_generate(data))
end

Instance Method Details

#fit(data) ⇒ self

Note:

UMAP’s training process inherently produces embeddings. Since the underlying Rust implementation doesn’t separate training from transformation, we call fit_transform but discard the embeddings. Use fit_transform if you need both training and the transformed data.

Fit the model to the data (training)

Parameters:

  • data (Array<Array<Numeric>>)

    Training data as 2D array

Returns:

  • (self)

    Returns self for method chaining



41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/clusterkit/dimensionality/umap.rb', line 41

def fit(data)
  validate_input(data)

  # Always recreate RustUMAP for fit to ensure fresh fit
  @rust_umap = nil
  create_rust_umap_with_adjusted_params(data)

  # UMAP doesn't separate training from transformation internally,
  # so we call fit_transform but discard the result
  begin
    Silence.maybe_silence do
      @rust_umap.fit_transform(data)
    end
    @fitted = true
    self
  rescue StandardError => e
    handle_umap_error(e, data)
  rescue => e
    # Handle fatal errors that aren't StandardError
    handle_umap_error(RuntimeError.new(e.message), data)
  end
end

#fit_transform(data) ⇒ Array<Array<Float>>

Fit the model and transform the data in one step

Parameters:

  • data (Array<Array<Numeric>>)

    Training data as 2D array

Returns:

  • (Array<Array<Float>>)

    Transformed data in reduced dimensions



79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# File 'lib/clusterkit/dimensionality/umap.rb', line 79

def fit_transform(data)
  validate_input(data)

  # Always recreate RustUMAP for fit_transform to ensure fresh fit
  @rust_umap = nil
  create_rust_umap_with_adjusted_params(data)

  begin
    result = Silence.maybe_silence do
      @rust_umap.fit_transform(data)
    end
    @fitted = true
    result
  rescue StandardError => e
    handle_umap_error(e, data)
  rescue => e
    # Handle fatal errors that aren't StandardError
    handle_umap_error(RuntimeError.new(e.message), data)
  end
end

#fitted?Boolean

Check if the model has been fitted

Returns:

  • (Boolean)

    true if model is fitted, false otherwise



102
103
104
# File 'lib/clusterkit/dimensionality/umap.rb', line 102

def fitted?
  @fitted
end

#save_model(path) ⇒ Object

Save the fitted model to a file

Parameters:

  • path (String)

    Path where to save the model

Raises:

  • (RuntimeError)

    If model hasn’t been fitted yet



109
110
111
112
113
114
115
116
117
# File 'lib/clusterkit/dimensionality/umap.rb', line 109

def save_model(path)
  raise RuntimeError, "No model to save. Call fit or fit_transform first." unless fitted?

  # Ensure directory exists
  dir = File.dirname(path)
  FileUtils.mkdir_p(dir) unless dir == '.' || dir == '/'

  @rust_umap.save_model(path)
end

#transform(data) ⇒ Array<Array<Float>>

Transform data using the fitted model

Parameters:

  • data (Array<Array<Numeric>>)

    Data to transform

Returns:

  • (Array<Array<Float>>)

    Transformed data in reduced dimensions

Raises:

  • (RuntimeError)

    If model hasn’t been fitted yet



68
69
70
71
72
73
74
# File 'lib/clusterkit/dimensionality/umap.rb', line 68

def transform(data)
  raise RuntimeError, "Model must be fitted before transform. Call fit or fit_transform first." unless fitted?
  validate_input(data, check_min_samples: false)
  Silence.maybe_silence do
    @rust_umap.transform(data)
  end
end