Class: Minhash::Algorithm

Inherits:
Object
  • Object
show all
Defined in:
lib/minhash.rb

Overview

The Minhash signature algorithm.

See section 3.3 of the www.mmds.org/ book: infolab.stanford.edu/~ullman/mmds/ch3.pdf

Simple XORs of random integer bit masks are used as the hash functions.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(masks) ⇒ Algorithm

Creates a new instance of the algorithm, with the given bit masks.



90
91
92
93
# File 'lib/minhash.rb', line 90

def initialize(masks)
  @masks = masks.freeze
  @hash_functions ||= @masks.map {|mask| lambda {|i| i ^ mask } }
end

Instance Attribute Details

#masksObject (readonly)

Returns the bit masks used to implement the hash functions.



86
87
88
# File 'lib/minhash.rb', line 86

def masks
  @masks
end

Class Method Details

.create(length) ⇒ Object

Creates a new instance of the algorithm with length random bit masks.



97
98
99
# File 'lib/minhash.rb', line 97

def self.create(length)
  new length.times.map { rand(2 ** 32 -1) }
end

Instance Method Details

#signature(tokens) ⇒ Object

Returns the minhash signature for a set of tokens.



102
103
104
# File 'lib/minhash.rb', line 102

def signature(tokens)
  @hash_functions.map {|f| tokens.map(&f).min }
end