Class: ClusterKit::HNSW
- Inherits:
-
Object
- Object
- ClusterKit::HNSW
- Defined in:
- lib/clusterkit/hnsw.rb
Overview
HNSW (Hierarchical Navigable Small World) index for fast approximate nearest neighbor search
Defined Under Namespace
Classes: Builder
Class Method Summary collapse
-
.from_embedding(embeddings, **kwargs) ⇒ HNSW
Create an index from embeddings produced by UMAP or other dimensionality reduction.
Instance Method Summary collapse
-
#<<(vector) ⇒ self
Add a vector using the << operator.
-
#batch_search(queries, k: 10, parallel: true) ⇒ Array<Array>
Batch search for multiple queries.
-
#clear! ⇒ self
Clear all elements from the index.
-
#empty? ⇒ Boolean
Check if index is empty.
-
#fit(data, labels: nil) ⇒ self
Fit the index with training data (alias for add_batch).
-
#fit_transform(data) ⇒ self
Fit and return transformed data (for compatibility with sklearn-like interface).
-
#include?(label) ⇒ Boolean
Check if a label exists in the index.
-
#knn_query(query, k: 10, ef: nil) ⇒ Array<Array>
Alias for search that always includes distances.
-
#range_search(query, radius:, limit: nil) ⇒ Array<Hash>
Range search - find all points within a given radius.
-
#recall(test_queries, ground_truth, k: 10) ⇒ Float
Get recall rate for a test set.
Class Method Details
.from_embedding(embeddings, **kwargs) ⇒ HNSW
Create an index from embeddings produced by UMAP or other dimensionality reduction
179 180 181 182 183 184 185 186 |
# File 'lib/clusterkit/hnsw.rb', line 179 def self.(, **kwargs) = ensure_array() dim = .first.size index = new(dim: dim, **kwargs) index.fit() index end |
Instance Method Details
#<<(vector) ⇒ self
Add a vector using the << operator
65 66 67 68 |
# File 'lib/clusterkit/hnsw.rb', line 65 def <<(vector) add_item(vector, {}) self end |
#batch_search(queries, k: 10, parallel: true) ⇒ Array<Array>
Batch search for multiple queries
86 87 88 89 90 91 92 93 94 95 96 97 98 |
# File 'lib/clusterkit/hnsw.rb', line 86 def batch_search(queries, k: 10, parallel: true) queries = ensure_array(queries) if parallel && queries.size > 1 require 'parallel' Parallel.map(queries) { |query| search(query, k: k) } else queries.map { |query| search(query, k: k) } end rescue LoadError # Parallel gem not available, fall back to sequential queries.map { |query| search(query, k: k) } end |
#clear! ⇒ self
Clear all elements from the index
127 128 129 130 |
# File 'lib/clusterkit/hnsw.rb', line 127 def clear! # Would need to recreate the index raise NotImplementedError, "Clear not yet implemented" end |
#empty? ⇒ Boolean
Check if index is empty
120 121 122 |
# File 'lib/clusterkit/hnsw.rb', line 120 def empty? size == 0 end |
#fit(data, labels: nil) ⇒ self
Fit the index with training data (alias for add_batch)
47 48 49 50 |
# File 'lib/clusterkit/hnsw.rb', line 47 def fit(data, labels: nil) add_batch(data, labels: labels) self end |
#fit_transform(data) ⇒ self
Fit and return transformed data (for compatibility with sklearn-like interface)
56 57 58 59 |
# File 'lib/clusterkit/hnsw.rb', line 56 def fit_transform(data) fit(data) self end |
#include?(label) ⇒ Boolean
Check if a label exists in the index
136 137 138 139 140 |
# File 'lib/clusterkit/hnsw.rb', line 136 def include?(label) # This would need to be implemented in Rust # For now, return false false end |
#knn_query(query, k: 10, ef: nil) ⇒ Array<Array>
Alias for search that always includes distances
76 77 78 |
# File 'lib/clusterkit/hnsw.rb', line 76 def knn_query(query, k: 10, ef: nil) search(query, k: k, ef: ef, include_distances: true) end |
#range_search(query, radius:, limit: nil) ⇒ Array<Hash>
Range search - find all points within a given radius
106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/clusterkit/hnsw.rb', line 106 def range_search(query, radius:, limit: nil) # Get a large number of candidates k = limit || size k = [k, size].min results = (query, k: k) # Filter by radius results.select { |r| r[:distance] <= radius } .take(limit || results.size) end |
#recall(test_queries, ground_truth, k: 10) ⇒ Float
Get recall rate for a test set
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'lib/clusterkit/hnsw.rb', line 148 def recall(test_queries, ground_truth, k: 10) test_queries = ensure_array(test_queries) require 'set' total_correct = 0 total_possible = 0 test_queries.each_with_index do |query, i| predicted = Set.new(search(query, k: k)) actual = Set.new(ground_truth[i].take(k)) total_correct += (predicted & actual).size total_possible += [k, actual.size].min end total_possible > 0 ? total_correct.to_f / total_possible : 0.0 end |