Class: Simhilarity::Matcher

Inherits:
Object
  • Object
show all
Defined in:
lib/simhilarity/matcher.rb

Overview

Abstract superclass for matching. Mainly a container for options, corpus, etc.

Direct Known Subclasses

Bulk, Single

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Matcher

Create a new Matcher matcher. Options include:

  • reader: Proc for turning opaque items into strings.

  • normalizer: Proc for normalizing strings.

  • ngrammer: Proc for generating ngrams.

  • verbose: If true, show progress bars and timing.



32
33
34
35
36
37
38
39
40
41
# File 'lib/simhilarity/matcher.rb', line 32

def initialize(options = {})
  @options = options

  # procs
  self.reader = options[:reader]
  self.normalizer = options[:normalizer]
  self.ngrammer = options[:ngrammer]

  self.freq = Hash.new(1)
end

Instance Attribute Details

#freqObject

Ngram frequency weights from the corpus, or 1 if the ngram isn’t in the corpus.



24
25
26
# File 'lib/simhilarity/matcher.rb', line 24

def freq
  @freq
end

#ngrammerObject

Proc for generating ngrams from a normalized string. See Matcher#ngrams for the default implementation.



20
21
22
# File 'lib/simhilarity/matcher.rb', line 20

def ngrammer
  @ngrammer
end

#normalizerObject

Proc for normalizing input strings. See Matcher#normalize for the default implementation.



16
17
18
# File 'lib/simhilarity/matcher.rb', line 16

def normalizer
  @normalizer
end

#optionsObject

Options used to create this Matcher.



7
8
9
# File 'lib/simhilarity/matcher.rb', line 7

def options
  @options
end

#readerObject

Proc for turning needle/haystack elements into strings. You can leave this nil if the elements are already strings. See Matcher#reader for the default implementation.



12
13
14
# File 'lib/simhilarity/matcher.rb', line 12

def reader
  @reader
end

Instance Method Details

#corpusObject

The current corpus.



65
66
67
# File 'lib/simhilarity/matcher.rb', line 65

def corpus
  @corpus
end

#corpus=(corpus) ⇒ Object

Set the corpus. Calculates ngram frequencies (#freq) for future scoring.



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/simhilarity/matcher.rb', line 45

def corpus=(corpus)
  @corpus = corpus

  # calculate ngram counts for the corpus
  counts = Hash.new(0)
  veach("Corpus", import_list(corpus)) do |element|
    element.ngrams.each do |ngram|
      counts[ngram] += 1
    end
  end

  # turn counts into inverse frequencies
  self.freq = Hash.new(1)
  total = counts.values.inject(&:+).to_f
  counts.each do |ngram, count|
    self.freq[ngram] = total / count
  end
end

#inspectObject

:nodoc:



120
121
122
# File 'lib/simhilarity/matcher.rb', line 120

def inspect #:nodoc:
  "Matcher"
end

#ngrams(str) ⇒ Object

Generate ngrams from a normalized str.



96
97
98
99
100
101
102
103
104
105
106
# File 'lib/simhilarity/matcher.rb', line 96

def ngrams(str)
  if ngrammer
    return ngrammer.call(str)
  end

  # two letter ngrams (bigrams)
  ngrams = str.each_char.each_cons(2).map(&:join)
  # runs of digits
  ngrams += str.scan(/\d+/)
  ngrams.uniq
end

#ngrams_sum(ngrams) ⇒ Object

Sum up the frequency weights of the ngrams.



109
110
111
# File 'lib/simhilarity/matcher.rb', line 109

def ngrams_sum(ngrams)
  ngrams.map { |i| @freq[i] }.inject(&:+) || 0
end

#normalize(incoming_str) ⇒ Object

Normalize an incoming string from the user.



82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/simhilarity/matcher.rb', line 82

def normalize(incoming_str)
  if normalizer
    return normalizer.call(incoming_str)
  end

  str = incoming_str
  str = str.downcase
  str = str.gsub(/[^a-z0-9]/, " ")
  # squish whitespace
  str = str.gsub(/\s+/, " ").strip
  str
end

#read(opaque) ⇒ Object

Turn an opaque item from the user into a string.



70
71
72
73
74
75
76
77
78
79
# File 'lib/simhilarity/matcher.rb', line 70

def read(opaque)
  if reader
    return reader.call(opaque)
  end

  if opaque.is_a?(String)
    return opaque
  end
  raise "can't turn #{opaque.inspect} into string"
end

#simhash(ngrams) ⇒ Object

Calculate the frequency weighted simhash of the ngrams.



116
117
118
# File 'lib/simhilarity/matcher.rb', line 116

def simhash(ngrams)
  Bits.simhash32(freq, ngrams)
end