Module: Minhash

Defined in:
lib/minhash.rb

Defined Under Namespace

Modules: StringExtensions Classes: Algorithm

Class Method Summary collapse

Class Method Details

.approximate_similarity(a, b) ⇒ Object

Returns the approximate jaccard similarity of 2 sets, given their signatures.



56
57
58
# File 'lib/minhash.rb', line 56

def self.approximate_similarity(a, b)
  a.length.times.select {|i| a[i] == b[i] }.size / a.length.to_f
end

.jaccard_similarity(a, b) ⇒ Object

Returns the jaccard similarity between two sets of shingles / tokens.



50
51
52
# File 'lib/minhash.rb', line 50

def self.jaccard_similarity(a, b)
  (a & b).size / (a | b).size.to_f
end

.k_shingles(text, k) ⇒ Object

Generates the k-shingles for a String.

For example, with the string “12345” and k = 3 the shingles are “123”, “234” and “345”.

k_shingles("ABCDE", 3)
# => ["ABC", "BCD", "CDE"]

Normalizes all whitespace to be single character spaces.

You want to select k to be large enough to be able to distinguish chunks of text adequately - a value of around 9 is sensible for reasonable sized chunks of standard written text.



18
19
20
21
22
23
24
25
26
27
28
# File 'lib/minhash.rb', line 18

def self.k_shingles(text, k)
  if text.size < k
    [text]
  else
    text.tr("\t\n\r", "   ").
      squeeze(" ").
      each_char.
      each_cons(k).
      map(&:join)
  end
end

.tokenized_k_shingles(text, k, &hash) ⇒ Object

Generates a Set of tokenized 32-bit integer k-shingles.

tokenized_k_shingles("ABCDE", 3)
# =>  #<Set: {1136772405, 3561005568, 944681077}>

MurmurHash3 is used by default; if you want to use a different hash function, pass a block:

tokenized_k_shingles("ABCDE", 3) do |shingle|
  # obviously a terrible hash function, just an example.
  shingle[0].ord
end
# => #<Set: {65, 66, 67}>


43
44
45
46
# File 'lib/minhash.rb', line 43

def self.tokenized_k_shingles(text, k, &hash)
  hash ||= lambda {|s| MurmurHash3::Native32.murmur3_32_str_hash(s) }
  k_shingles(text, k).map {|shingle| hash.call(shingle) }.to_set
end