Class: ClusterKit::HNSW

Inherits:
Object
  • Object
show all
Defined in:
lib/clusterkit/hnsw.rb

Overview

HNSW (Hierarchical Navigable Small World) index for fast approximate nearest neighbor search

Examples:

Basic usage

index = ClusterKit::HNSW.new(dim: 128, space: :euclidean)
index.add_batch(vectors, labels: labels)
neighbors = index.search(query_vector, k: 10)

With metadata

index = ClusterKit::HNSW.new(dim: 768, space: :cosine)
index.add_item(vector, label: "doc_1", metadata: { title: "Introduction", date: "2024-01-01" })
results = index.(query, k: 5)
# => [{ label: "doc_1", distance: 0.23, metadata: { title: "...", date: "..." } }, ...]

Defined Under Namespace

Classes: Builder

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.from_embedding(embeddings, **kwargs) ⇒ HNSW

Create an index from embeddings produced by UMAP or other dimensionality reduction

Parameters:

  • embeddings (Array<Array>, Numo::NArray)

    Embedding vectors

  • kwargs (Hash)

    Additional options for HNSW initialization

Returns:

  • (HNSW)

    New HNSW instance



179
180
181
182
183
184
185
186
# File 'lib/clusterkit/hnsw.rb', line 179

def self.from_embedding(embeddings, **kwargs)
  embeddings = ensure_array(embeddings)
  
  dim = embeddings.first.size
  index = new(dim: dim, **kwargs)
  index.fit(embeddings)
  index
end

Instance Method Details

#<<(vector) ⇒ self

Add a vector using the << operator

Parameters:

  • vector (Array, Numo::NArray)

    Vector to add

Returns:

  • (self)


65
66
67
68
# File 'lib/clusterkit/hnsw.rb', line 65

def <<(vector)
  add_item(vector, {})
  self
end

#batch_search(queries, k: 10, parallel: true) ⇒ Array<Array>

Batch search for multiple queries

Parameters:

  • queries (Array<Array>, Numo::NArray)

    Multiple query vectors

  • k (Integer) (defaults to: 10)

    Number of neighbors per query

  • parallel (Boolean) (defaults to: true)

    Process queries in parallel

Returns:

  • (Array<Array>)

    Results for each query



86
87
88
89
90
91
92
93
94
95
96
97
98
# File 'lib/clusterkit/hnsw.rb', line 86

def batch_search(queries, k: 10, parallel: true)
  queries = ensure_array(queries)
  
  if parallel && queries.size > 1
    require 'parallel'
    Parallel.map(queries) { |query| search(query, k: k) }
  else
    queries.map { |query| search(query, k: k) }
  end
rescue LoadError
  # Parallel gem not available, fall back to sequential
  queries.map { |query| search(query, k: k) }
end

#clear!self

Clear all elements from the index

Returns:

  • (self)

Raises:

  • (NotImplementedError)


127
128
129
130
# File 'lib/clusterkit/hnsw.rb', line 127

def clear!
  # Would need to recreate the index
  raise NotImplementedError, "Clear not yet implemented"
end

#empty?Boolean

Check if index is empty

Returns:

  • (Boolean)


120
121
122
# File 'lib/clusterkit/hnsw.rb', line 120

def empty?
  size == 0
end

#fit(data, labels: nil) ⇒ self

Fit the index with training data (alias for add_batch)

Parameters:

  • data (Array<Array>, Numo::NArray)

    Training vectors

  • labels (Array, nil) (defaults to: nil)

    Optional labels for vectors

Returns:

  • (self)


47
48
49
50
# File 'lib/clusterkit/hnsw.rb', line 47

def fit(data, labels: nil)
  add_batch(data, labels: labels)
  self
end

#fit_transform(data) ⇒ self

Fit and return transformed data (for compatibility with sklearn-like interface)

Parameters:

  • data (Array<Array>, Numo::NArray)

    Training vectors

Returns:

  • (self)


56
57
58
59
# File 'lib/clusterkit/hnsw.rb', line 56

def fit_transform(data)
  fit(data)
  self
end

#include?(label) ⇒ Boolean

Check if a label exists in the index

Parameters:

  • label (String, Integer)

    Label to check

Returns:

  • (Boolean)


136
137
138
139
140
# File 'lib/clusterkit/hnsw.rb', line 136

def include?(label)
  # This would need to be implemented in Rust
  # For now, return false
  false
end

#knn_query(query, k: 10, ef: nil) ⇒ Array<Array>

Alias for search that always includes distances

Parameters:

  • query (Array, Numo::NArray)

    Query vector

  • k (Integer) (defaults to: 10)

    Number of neighbors

  • ef (Integer, nil) (defaults to: nil)

    Search parameter (higher = better quality, slower)

Returns:

  • (Array<Array>)

    Array of [indices, distances]



76
77
78
# File 'lib/clusterkit/hnsw.rb', line 76

def knn_query(query, k: 10, ef: nil)
  search(query, k: k, ef: ef, include_distances: true)
end

#range_search(query, radius:, limit: nil) ⇒ Array<Hash>

Range search - find all points within a given radius

Parameters:

  • query (Array, Numo::NArray)

    Query vector

  • radius (Float)

    Search radius

  • limit (Integer, nil) (defaults to: nil)

    Maximum number of results

Returns:

  • (Array<Hash>)

    Results within radius



106
107
108
109
110
111
112
113
114
115
116
# File 'lib/clusterkit/hnsw.rb', line 106

def range_search(query, radius:, limit: nil)
  # Get a large number of candidates
  k = limit || size
  k = [k, size].min
  
  results = (query, k: k)
  
  # Filter by radius
  results.select { |r| r[:distance] <= radius }
         .take(limit || results.size)
end

#recall(test_queries, ground_truth, k: 10) ⇒ Float

Get recall rate for a test set

Parameters:

  • test_queries (Array<Array>)

    Query vectors

  • ground_truth (Array<Array>)

    True nearest neighbors for each query

  • k (Integer) (defaults to: 10)

    Number of neighbors to evaluate

Returns:

  • (Float)

    Recall rate (0.0 to 1.0)



148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# File 'lib/clusterkit/hnsw.rb', line 148

def recall(test_queries, ground_truth, k: 10)
  test_queries = ensure_array(test_queries)
  
  require 'set'
  total_correct = 0
  total_possible = 0
  
  test_queries.each_with_index do |query, i|
    predicted = Set.new(search(query, k: k))
    actual = Set.new(ground_truth[i].take(k))
    
    total_correct += (predicted & actual).size
    total_possible += [k, actual.size].min
  end
  
  total_possible > 0 ? total_correct.to_f / total_possible : 0.0
end