Module: Minhash
- Defined in:
- lib/minhash.rb
Defined Under Namespace
Modules: StringExtensions Classes: Algorithm
Class Method Summary collapse
-
.approximate_similarity(a, b) ⇒ Object
Returns the approximate jaccard similarity of 2 sets, given their signatures.
-
.jaccard_similarity(a, b) ⇒ Object
Returns the jaccard similarity between two sets of shingles / tokens.
-
.k_shingles(text, k) ⇒ Object
Generates the k-shingles for a String.
-
.tokenized_k_shingles(text, k, &hash) ⇒ Object
Generates a Set of tokenized 32-bit integer k-shingles.
Class Method Details
.approximate_similarity(a, b) ⇒ Object
Returns the approximate jaccard similarity of 2 sets, given their signatures.
56 57 58 |
# File 'lib/minhash.rb', line 56 def self.approximate_similarity(a, b) a.length.times.select {|i| a[i] == b[i] }.size / a.length.to_f end |
.jaccard_similarity(a, b) ⇒ Object
Returns the jaccard similarity between two sets of shingles / tokens.
50 51 52 |
# File 'lib/minhash.rb', line 50 def self.jaccard_similarity(a, b) (a & b).size / (a | b).size.to_f end |
.k_shingles(text, k) ⇒ Object
Generates the k-shingles for a String.
For example, with the string “12345” and k = 3 the shingles are “123”, “234” and “345”.
k_shingles("ABCDE", 3)
# => ["ABC", "BCD", "CDE"]
Normalizes all whitespace to be single character spaces.
You want to select k to be large enough to be able to distinguish chunks of text adequately - a value of around 9 is sensible for reasonable sized chunks of standard written text.
18 19 20 21 22 23 24 25 26 27 28 |
# File 'lib/minhash.rb', line 18 def self.k_shingles(text, k) if text.size < k [text] else text.tr("\t\n\r", " "). squeeze(" "). each_char. each_cons(k). map(&:join) end end |
.tokenized_k_shingles(text, k, &hash) ⇒ Object
Generates a Set of tokenized 32-bit integer k-shingles.
tokenized_k_shingles("ABCDE", 3)
# => #<Set: {1136772405, 3561005568, 944681077}>
MurmurHash3 is used by default; if you want to use a different hash function, pass a block:
tokenized_k_shingles("ABCDE", 3) do |shingle|
# obviously a terrible hash function, just an example.
shingle[0].ord
end
# => #<Set: {65, 66, 67}>
43 44 45 46 |
# File 'lib/minhash.rb', line 43 def self.tokenized_k_shingles(text, k, &hash) hash ||= lambda {|s| MurmurHash3::Native32.murmur3_32_str_hash(s) } k_shingles(text, k).map {|shingle| hash.call(shingle) }.to_set end |