Class: Namor::Comparator

Inherits:
Object
  • Object
show all
Defined in:
lib/namor/comparator.rb

Overview

MULTI-MATCHING via components go through all users group by distinct sets of components pick a (small) subset of component-keys, say <10. Maybe random sample? build a set of matching rules run the subset * the full corpus * the matching rules

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(corpus) ⇒ Comparator

Returns a new instance of Comparator.



11
12
13
14
15
# File 'lib/namor/comparator.rb', line 11

def initialize(corpus)
  @corpus = corpus

  prep_missing_initials
end

Instance Attribute Details

#corpusObject (readonly)

Returns the value of attribute corpus.



9
10
11
# File 'lib/namor/comparator.rb', line 9

def corpus
  @corpus
end

Instance Method Details

#crunch(record) ⇒ Object



17
18
19
20
21
22
23
# File 'lib/namor/comparator.rb', line 17

def crunch(record)
  (@corpus - [record]).each_with_object([]) do |candidate,matches|
    if evaluate(record, candidate)
      matches << candidate
    end
  end
end

#evaluate(record, candidate) ⇒ Object



25
26
27
28
29
30
# File 'lib/namor/comparator.rb', line 25

def evaluate(record, candidate)
  [:missing_initials].each do |rule|
    return true if send(rule, record, candidate)
  end
  false
end

#matching_all_but_one(a, b) ⇒ Object

ignore any initials. look for cases where there is exactly one name component that differs between the inputs.



78
79
80
81
82
83
# File 'lib/namor/comparator.rb', line 78

def matching_all_but_one(a,b)
  longnames_a = a.select {|s| s.length > 1}
  longnames_b = b.select {|s| s.length > 1}

  ((longnames_a | longnames_b) - (longnames_a & longnames_b)).count == 1
end

#matching_initials(a, b) ⇒ Object

must have at least 1 long (non-initial-only) component in each those long parts must be identical all initials should correspond to non-matched longnames in the other input



61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/namor/comparator.rb', line 61

def matching_initials(a,b)
  longnames_a = a.select {|s| s.length > 1}
  longnames_b = b.select {|s| s.length > 1}
  inits_a = a.select {|s| s.length == 1}
  inits_b = b.select {|s| s.length == 1}

  return false unless longnames_a.count >= 1 && longnames_b.count >= 1

  unmatched_longnames_a = longnames_a - longnames_b
  unmatched_longnames_b = longnames_b - longnames_a
  unmatched_inits_a = unmatched_longnames_a.map {|s| s[0]}
  unmatched_inits_b = unmatched_longnames_b.map {|s| s[0]}

  inits_a == unmatched_inits_b && inits_b == unmatched_inits_a
end

#missing_initials(a, b) ⇒ Object

must have at least 2 long (non-initial-only) components in each those long parts must be identical only one of the names can have any initials



40
41
42
43
44
45
46
47
# File 'lib/namor/comparator.rb', line 40

def missing_initials(a,b)
  longnames_a = a.select {|s| s.length > 1}
  longnames_b = b.select {|s| s.length > 1}
  inits_a = a.select {|s| s.length == 1}
  inits_b = b.select {|s| s.length == 1}

  longnames_a.count >= 2 && longnames_b.count >= 2 && longnames_a == longnames_b && (inits_a.empty? || inits_b.empty?)
end

#prep_missing_initialsObject



49
50
51
52
53
54
55
56
# File 'lib/namor/comparator.rb', line 49

def prep_missing_initials
  @corpus_missing_initials = corpus.each_with_object(Set.new) do |rec,set|
    without_initials = rec.select {|s| s.length > 1}
    if without_initials.count >= 2
      set << without_initials
    end
  end
end