Class: Minhash::Minhash

Inherits:
Object
  • Object
show all
Defined in:
lib/doc_sim/minhash.rb

Overview

Class for generating Minhash signature

Constant Summary collapse

HASH_MAX =

Hashes will always be <= 2**32

(2**32) + 1

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(n_hashes = 1, seed_root = rand(2**32)) ⇒ Minhash

Returns a new instance of Minhash.



13
14
15
16
17
18
# File 'lib/doc_sim/minhash.rb', line 13

def initialize(n_hashes = 1, seed_root = rand(2**32))
  @seed_root = seed_root
  @hashes = Array.new(n_hashes) do |seed|
    ->(x) { MurmurHash3::V32.str_hash(x, seed_root + seed) }
  end
end

Instance Attribute Details

#seed_rootObject (readonly)

Returns the value of attribute seed_root.



8
9
10
# File 'lib/doc_sim/minhash.rb', line 8

def seed_root
  @seed_root
end

Instance Method Details

#signature(set) ⇒ Array[Integer]

Produces the Minhash signature for a given Set

Parameters:

  • set (Set[String])

    the set to produce the signature for

Returns:

  • (Array[Integer])

    32 bit integer array of length n_hashes



25
26
27
28
29
30
31
32
33
# File 'lib/doc_sim/minhash.rb', line 25

def signature(set)
  counter = Array.new(@hashes.length, Minhash::HASH_MAX)
  set.each do |elem|
    @hashes.each_with_index do |hash_func, i|
      counter[i] = [counter[i], hash_func.call(elem)].min
    end
  end
  counter
end