Class: Simhilarity::Bulk

Inherits:
Matcher show all
Defined in:
lib/simhilarity/bulk.rb

Overview

Match a set of needles against a haystack, in bulk. For example, this is used if you want to match 50 new addresses against your database of 1,000 known addresses.

Constant Summary collapse

DEFAULT_NGRAM_OVERLAPS =

default minimum number # of ngram overlaps with :ngrams

3
DEFAULT_SIMHASH_MAX_HAMMING =

default maximum hamming distance with :simhash

7

Instance Attribute Summary

Attributes inherited from Matcher

#freq, #ngrammer, #normalizer, #options, #reader

Instance Method Summary collapse

Methods inherited from Matcher

#corpus, #corpus=, #inspect, #ngrams, #ngrams_sum, #normalize, #read, #simhash

Constructor Details

#initialize(options = {}) ⇒ Bulk

Initialize a new Bulk matcher. See Matcher#initialize. Bulk adds these options:

  • candidates: specifies which method to use for finding candidates. See the README for more details.

  • ngrams_overlaps: Minimum number of ngram overlaps, defaults to 3.

  • simhash_max_hamming: Maximum simhash hamming distance, defaults to 7.



23
24
25
# File 'lib/simhilarity/bulk.rb', line 23

def initialize(options = {})
  super(options)
end

Instance Method Details

#matches(needles, haystack) ⇒ Object

Match each item in needles to an item in haystack. Returns an array of tuples, [needle, haystack, score]. Scores range from 0 to 1, with 1 being a perfect match and 0 being a terrible match.



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/simhilarity/bulk.rb', line 31

def matches(needles, haystack)
  # create Elements
  if needles == haystack
    needles = haystack = import_list(needles)

    # set the corpus, to generate frequency weights
    self.corpus = needles
  else
    needles = import_list(needles)
    haystack = import_list(haystack)

    # set the corpus, to generate frequency weights
    self.corpus = (needles + haystack)
  end

  # get candidate matches
  candidates = candidates(needles, haystack)
  vputs " got #{candidates.length} candidates."

  # pick winners
  winners(needles, candidates)
end