Class: ClusterKit::HNSW

Inherits:

Object

Object
ClusterKit::HNSW

show all

Defined in:: lib/clusterkit/hnsw.rb

Overview

HNSW (Hierarchical Navigable Small World) index for fast approximate nearest neighbor search

Examples:

Basic usage

index = ClusterKit::HNSW.new(dim: 128, space: :euclidean)
index.add_batch(vectors, labels: labels)
neighbors = index.search(query_vector, k: 10)

With metadata

index = ClusterKit::HNSW.new(dim: 768, space: :cosine)
index.add_item(vector, label: "doc_1", metadata: { title: "Introduction", date: "2024-01-01" })
results = index.search_with_metadata(query, k: 5)
# => [{ label: "doc_1", distance: 0.23, metadata: { title: "...", date: "..." } }, ...]

Defined Under Namespace

Classes: Builder

Class Method Summary collapse

.from_embedding(embeddings, **kwargs) ⇒ HNSW

Create an index from embeddings produced by UMAP or other dimensionality reduction.

Instance Method Summary collapse

#<<(vector) ⇒ self

Add a vector using the << operator.
#batch_search(queries, k: 10, parallel: true) ⇒ Array<Array>

Batch search for multiple queries.
#clear! ⇒ self

Clear all elements from the index.
#empty? ⇒ Boolean

Check if index is empty.
#fit(data, labels: nil) ⇒ self

Fit the index with training data (alias for add_batch).
#fit_transform(data) ⇒ self

Fit and return transformed data (for compatibility with sklearn-like interface).
#include?(label) ⇒ Boolean

Check if a label exists in the index.
#knn_query(query, k: 10, ef: nil) ⇒ Array<Array>

Alias for search that always includes distances.
#range_search(query, radius:, limit: nil) ⇒ Array<Hash>

Range search - find all points within a given radius.
#recall(test_queries, ground_truth, k: 10) ⇒ Float

Get recall rate for a test set.

Class Method Details

.from_embedding(embeddings, **kwargs) ⇒ `HNSW`

Create an index from embeddings produced by UMAP or other dimensionality reduction

Parameters:

embeddings (Array<Array>, Numo::NArray) —

Embedding vectors
kwargs (Hash) —

Additional options for HNSW initialization

Returns:

(HNSW) —

New HNSW instance

# File 'lib/clusterkit/hnsw.rb', line 179

def self.from_embedding(embeddings, **kwargs)
  embeddings = ensure_array(embeddings)
  
  dim = embeddings.first.size
  index = new(dim: dim, **kwargs)
  index.fit(embeddings)
  index
end

Instance Method Details

#<<(vector) ⇒ `self`

Add a vector using the << operator

Parameters:

vector (Array, Numo::NArray) —

Vector to add

Returns:

(self)

# File 'lib/clusterkit/hnsw.rb', line 65

def <<(vector)
  add_item(vector, {})
  self
end

#batch_search(queries, k: 10, parallel: true) ⇒ `Array<Array>`

Batch search for multiple queries

Parameters:

queries (Array<Array>, Numo::NArray) —

Multiple query vectors
k (Integer) (defaults to: 10) —

Number of neighbors per query
parallel (Boolean) (defaults to: true) —

Process queries in parallel

Returns:

(Array<Array>) —

Results for each query

# File 'lib/clusterkit/hnsw.rb', line 86

def batch_search(queries, k: 10, parallel: true)
  queries = ensure_array(queries)
  
  if parallel && queries.size > 1
    require 'parallel'
    Parallel.map(queries) { |query| search(query, k: k) }
  else
    queries.map { |query| search(query, k: k) }
  end
rescue LoadError
  # Parallel gem not available, fall back to sequential
  queries.map { |query| search(query, k: k) }
end

#clear! ⇒ `self`

Clear all elements from the index

Returns:

(self)

Raises:

(NotImplementedError)

# File 'lib/clusterkit/hnsw.rb', line 127

def clear!
  # Would need to recreate the index
  raise NotImplementedError, "Clear not yet implemented"
end

#empty? ⇒ `Boolean`

Check if index is empty

Returns:

(Boolean)



120
121
122

# File 'lib/clusterkit/hnsw.rb', line 120

def empty?
  size == 0
end

#fit(data, labels: nil) ⇒ `self`

Fit the index with training data (alias for add_batch)

Parameters:

data (Array<Array>, Numo::NArray) —

Training vectors
labels (Array, nil) (defaults to: nil) —

Optional labels for vectors

Returns:

(self)

# File 'lib/clusterkit/hnsw.rb', line 47

def fit(data, labels: nil)
  add_batch(data, labels: labels)
  self
end

#fit_transform(data) ⇒ `self`

Fit and return transformed data (for compatibility with sklearn-like interface)

Parameters:

data (Array<Array>, Numo::NArray) —

Training vectors

Returns:

(self)

# File 'lib/clusterkit/hnsw.rb', line 56

def fit_transform(data)
  fit(data)
  self
end

#include?(label) ⇒ `Boolean`

Check if a label exists in the index

Parameters:

label (String, Integer) —

Label to check

Returns:

(Boolean)

# File 'lib/clusterkit/hnsw.rb', line 136

def include?(label)
  # This would need to be implemented in Rust
  # For now, return false
  false
end

#knn_query(query, k: 10, ef: nil) ⇒ `Array<Array>`

Alias for search that always includes distances

Parameters:

query (Array, Numo::NArray) —

Query vector
k (Integer) (defaults to: 10) —

Number of neighbors
ef (Integer, nil) (defaults to: nil) —

Search parameter (higher = better quality, slower)

Returns:

(Array<Array>) —

Array of [indices, distances]



76
77
78

# File 'lib/clusterkit/hnsw.rb', line 76

def knn_query(query, k: 10, ef: nil)
  search(query, k: k, ef: ef, include_distances: true)
end

#range_search(query, radius:, limit: nil) ⇒ `Array<Hash>`

Range search - find all points within a given radius

Parameters:

query (Array, Numo::NArray) —

Query vector
radius (Float) —

Search radius
limit (Integer, nil) (defaults to: nil) —

Maximum number of results

Returns:

(Array<Hash>) —

Results within radius

# File 'lib/clusterkit/hnsw.rb', line 106

def range_search(query, radius:, limit: nil)
  # Get a large number of candidates
  k = limit || size
  k = [k, size].min
  
  results = search_with_metadata(query, k: k)
  
  # Filter by radius
  results.select { |r| r[:distance] <= radius }
         .take(limit || results.size)
end

#recall(test_queries, ground_truth, k: 10) ⇒ `Float`

Get recall rate for a test set

Parameters:

test_queries (Array<Array>) —

Query vectors
ground_truth (Array<Array>) —

True nearest neighbors for each query
k (Integer) (defaults to: 10) —

Number of neighbors to evaluate

Returns:

(Float) —

Recall rate (0.0 to 1.0)

# File 'lib/clusterkit/hnsw.rb', line 148

def recall(test_queries, ground_truth, k: 10)
  test_queries = ensure_array(test_queries)
  
  require 'set'
  total_correct = 0
  total_possible = 0
  
  test_queries.each_with_index do |query, i|
    predicted = Set.new(search(query, k: k))
    actual = Set.new(ground_truth[i].take(k))
    
    total_correct += (predicted & actual).size
    total_possible += [k, actual.size].min
  end
  
  total_possible > 0 ? total_correct.to_f / total_possible : 0.0
end

Class: ClusterKit::HNSW

Overview

Examples:

Basic usage

With metadata

Defined Under Namespace

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.from_embedding(embeddings, **kwargs) ⇒ HNSW

Instance Method Details

#<<(vector) ⇒ self

#batch_search(queries, k: 10, parallel: true) ⇒ Array<Array>

#clear! ⇒ self

#empty? ⇒ Boolean

#fit(data, labels: nil) ⇒ self

#fit_transform(data) ⇒ self

#include?(label) ⇒ Boolean

#knn_query(query, k: 10, ef: nil) ⇒ Array<Array>

#range_search(query, radius:, limit: nil) ⇒ Array<Hash>

#recall(test_queries, ground_truth, k: 10) ⇒ Float

.from_embedding(embeddings, **kwargs) ⇒ `HNSW`

#<<(vector) ⇒ `self`

#batch_search(queries, k: 10, parallel: true) ⇒ `Array<Array>`

#clear! ⇒ `self`

#empty? ⇒ `Boolean`

#fit(data, labels: nil) ⇒ `self`

#fit_transform(data) ⇒ `self`

#include?(label) ⇒ `Boolean`

#knn_query(query, k: 10, ef: nil) ⇒ `Array<Array>`

#range_search(query, radius:, limit: nil) ⇒ `Array<Hash>`

#recall(test_queries, ground_truth, k: 10) ⇒ `Float`