Module: Simhilarity::Bits
- Defined in:
- lib/simhilarity/bits.rb
Constant Summary collapse
- HAMMING8 =
(0..0xff).map { |i| Bits.hamming(0, i) }
- HAMMING16 =
(0..0xffff).map { |i| HAMMING8[(i >> 8) & 0xff] + HAMMING8[(i >> 0) & 0xff] }
Class Method Summary collapse
-
.hamming(a, b) ⇒ Object
Calculate the hamming distance between two integers.
-
.hamming32(a, b) ⇒ Object
Calculate the hamming distance between two 32 bit integers using a lookup table.
-
.nhash(ngram) ⇒ Object
can’t rely on ruby hash, because it’s not consistent across sessions.
-
.simhash32(freq, ngrams) ⇒ Object
Calculate the frequency weighted simhash of the
ngrams
.
Class Method Details
.hamming(a, b) ⇒ Object
Calculate the hamming distance between two integers. Not particularly fast.
8 9 10 11 12 13 14 15 |
# File 'lib/simhilarity/bits.rb', line 8 def self.hamming(a, b) x, d = 0, a ^ b while d > 0 x += 1 d &= d - 1 end x end |
.hamming32(a, b) ⇒ Object
Calculate the hamming distance between two 32 bit integers using a lookup table. This is fast.
23 24 25 26 27 28 |
# File 'lib/simhilarity/bits.rb', line 23 def self.hamming32(a, b) x = a ^ b a = (x >> 16) & 0xffff b = (x >> 0) & 0xffff HAMMING16[a] + HAMMING16[b] end |
.nhash(ngram) ⇒ Object
can’t rely on ruby hash, because it’s not consistent across sessions. Let’s just use MD5.
32 33 34 35 |
# File 'lib/simhilarity/bits.rb', line 32 def self.nhash(ngram) @hashes ||= { } @hashes[ngram] ||= Digest::MD5.hexdigest(ngram).to_i(16) end |
.simhash32(freq, ngrams) ⇒ Object
Calculate the frequency weighted simhash of the ngrams
.
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/simhilarity/bits.rb', line 40 def self.simhash32(freq, ngrams) # array of bit sums bits = Array.new(32, 0) # walk bits of ngram's hash, increase/decrease bit sums ngrams.each do |ngram| f = freq[ngram] hash = nhash(ngram) (0...32).each do |i| bits[i] += (((hash >> i) & 1) == 1) ? f : -f end end # calculate simhash based on whether bit sums are negative or # positive simhash = 0 (0...32).each do |bit| simhash |= (1 << bit) if bits[bit] > 0 end simhash end |