Module: Simhilarity::Bits

Defined in:
lib/simhilarity/bits.rb

Constant Summary collapse

HAMMING8 =
(0..0xff).map { |i| Bits.hamming(0, i) }
HAMMING16 =
(0..0xffff).map { |i| HAMMING8[(i >> 8) & 0xff] + HAMMING8[(i >> 0) & 0xff] }

Class Method Summary collapse

Class Method Details

.hamming(a, b) ⇒ Object

Calculate the hamming distance between two integers. Not particularly fast.



8
9
10
11
12
13
14
15
# File 'lib/simhilarity/bits.rb', line 8

def self.hamming(a, b)
  x, d = 0, a ^ b
  while d > 0
    x += 1
    d &= d - 1
  end
  x
end

.hamming32(a, b) ⇒ Object

Calculate the hamming distance between two 32 bit integers using a lookup table. This is fast.



23
24
25
26
27
28
# File 'lib/simhilarity/bits.rb', line 23

def self.hamming32(a, b)
  x = a ^ b
  a = (x >> 16) & 0xffff
  b = (x >>  0) & 0xffff
  HAMMING16[a] + HAMMING16[b]
end

.nhash(ngram) ⇒ Object

can’t rely on ruby hash, because it’s not consistent across sessions. Let’s just use MD5.



32
33
34
35
# File 'lib/simhilarity/bits.rb', line 32

def self.nhash(ngram)
  @hashes ||= { }
  @hashes[ngram] ||= Digest::MD5.hexdigest(ngram).to_i(16)
end

.simhash32(freq, ngrams) ⇒ Object

Calculate the frequency weighted simhash of the ngrams.



40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/simhilarity/bits.rb', line 40

def self.simhash32(freq, ngrams)
  # array of bit sums
  bits = Array.new(32, 0)

  # walk bits of ngram's hash, increase/decrease bit sums
  ngrams.each do |ngram|
    f = freq[ngram]
    hash = nhash(ngram)
    (0...32).each do |i|
      bits[i] += (((hash >> i) & 1) == 1) ? f : -f
    end
  end

  # calculate simhash based on whether bit sums are negative or
  # positive
  simhash = 0
  (0...32).each do |bit|
    simhash |= (1 << bit) if bits[bit] > 0
  end
  simhash
end