Class: Simhilarity::Bulk
Overview
Match a set of needles against a haystack, in bulk. For example, this is used if you want to match 50 new addresses against your database of 1,000 known addresses.
Constant Summary collapse
- DEFAULT_NGRAM_OVERLAPS =
default minimum number # of ngram overlaps with :ngrams
3
- DEFAULT_SIMHASH_MAX_HAMMING =
default maximum hamming distance with :simhash
7
Instance Attribute Summary
Attributes inherited from Matcher
#freq, #ngrammer, #normalizer, #options, #reader
Instance Method Summary collapse
-
#initialize(options = {}) ⇒ Bulk
constructor
Initialize a new Bulk matcher.
-
#matches(needles, haystack) ⇒ Object
Match each item in
needles
to an item inhaystack
.
Methods inherited from Matcher
#corpus, #corpus=, #inspect, #ngrams, #ngrams_sum, #normalize, #read, #simhash
Constructor Details
#initialize(options = {}) ⇒ Bulk
Initialize a new Bulk matcher. See Matcher#initialize. Bulk adds these options:
-
candidates
: specifies which method to use for finding candidates. See the README for more details. -
ngrams_overlaps
: Minimum number of ngram overlaps, defaults to 3. -
simhash_max_hamming
: Maximum simhash hamming distance, defaults to 7.
23 24 25 |
# File 'lib/simhilarity/bulk.rb', line 23 def initialize( = {}) super() end |
Instance Method Details
#matches(needles, haystack) ⇒ Object
Match each item in needles
to an item in haystack
. Returns an array of tuples, [needle, haystack, score]
. Scores range from 0 to 1, with 1 being a perfect match and 0 being a terrible match.
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/simhilarity/bulk.rb', line 31 def matches(needles, haystack) # create Elements if needles == haystack needles = haystack = import_list(needles) # set the corpus, to generate frequency weights self.corpus = needles else needles = import_list(needles) haystack = import_list(haystack) # set the corpus, to generate frequency weights self.corpus = (needles + haystack) end # get candidate matches candidates = candidates(needles, haystack) vputs " got #{candidates.length} candidates." # pick winners winners(needles, candidates) end |