Class: Simhilarity::Matcher
- Inherits:
-
Object
- Object
- Simhilarity::Matcher
- Defined in:
- lib/simhilarity/matcher.rb
Overview
Abstract superclass for matching. Mainly a container for options, corpus, etc.
Instance Attribute Summary collapse
-
#freq ⇒ Object
Ngram frequency weights from the corpus, or 1 if the ngram isn’t in the corpus.
-
#ngrammer ⇒ Object
Proc for generating ngrams from a normalized string.
-
#normalizer ⇒ Object
Proc for normalizing input strings.
-
#options ⇒ Object
Options used to create this Matcher.
-
#reader ⇒ Object
Proc for turning needle/haystack elements into strings.
Instance Method Summary collapse
-
#corpus ⇒ Object
The current corpus.
-
#corpus=(corpus) ⇒ Object
Set the corpus.
-
#initialize(options = {}) ⇒ Matcher
constructor
Create a new Matcher matcher.
-
#inspect ⇒ Object
:nodoc:.
-
#ngrams(str) ⇒ Object
Generate ngrams from a normalized str.
-
#ngrams_sum(ngrams) ⇒ Object
Sum up the frequency weights of the
ngrams
. -
#normalize(incoming_str) ⇒ Object
Normalize an incoming string from the user.
-
#read(opaque) ⇒ Object
Turn an opaque item from the user into a string.
-
#simhash(ngrams) ⇒ Object
Calculate the frequency weighted simhash of the
ngrams
.
Constructor Details
#initialize(options = {}) ⇒ Matcher
Create a new Matcher matcher. Options include:
-
reader
: Proc for turning opaque items into strings. -
normalizer
: Proc for normalizing strings. -
ngrammer
: Proc for generating ngrams. -
verbose
: If true, show progress bars and timing.
32 33 34 35 36 37 38 39 40 41 |
# File 'lib/simhilarity/matcher.rb', line 32 def initialize( = {}) @options = # procs self.reader = [:reader] self.normalizer = [:normalizer] self.ngrammer = [:ngrammer] self.freq = Hash.new(1) end |
Instance Attribute Details
#freq ⇒ Object
Ngram frequency weights from the corpus, or 1 if the ngram isn’t in the corpus.
24 25 26 |
# File 'lib/simhilarity/matcher.rb', line 24 def freq @freq end |
#ngrammer ⇒ Object
Proc for generating ngrams from a normalized string. See Matcher#ngrams for the default implementation.
20 21 22 |
# File 'lib/simhilarity/matcher.rb', line 20 def ngrammer @ngrammer end |
#normalizer ⇒ Object
Proc for normalizing input strings. See Matcher#normalize for the default implementation.
16 17 18 |
# File 'lib/simhilarity/matcher.rb', line 16 def normalizer @normalizer end |
#options ⇒ Object
Options used to create this Matcher.
7 8 9 |
# File 'lib/simhilarity/matcher.rb', line 7 def @options end |
#reader ⇒ Object
Proc for turning needle/haystack elements into strings. You can leave this nil if the elements are already strings. See Matcher#reader for the default implementation.
12 13 14 |
# File 'lib/simhilarity/matcher.rb', line 12 def reader @reader end |
Instance Method Details
#corpus ⇒ Object
The current corpus.
65 66 67 |
# File 'lib/simhilarity/matcher.rb', line 65 def corpus @corpus end |
#corpus=(corpus) ⇒ Object
Set the corpus. Calculates ngram frequencies (#freq) for future scoring.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/simhilarity/matcher.rb', line 45 def corpus=(corpus) @corpus = corpus # calculate ngram counts for the corpus counts = Hash.new(0) veach("Corpus", import_list(corpus)) do |element| element.ngrams.each do |ngram| counts[ngram] += 1 end end # turn counts into inverse frequencies self.freq = Hash.new(1) total = counts.values.inject(&:+).to_f counts.each do |ngram, count| self.freq[ngram] = total / count end end |
#inspect ⇒ Object
:nodoc:
120 121 122 |
# File 'lib/simhilarity/matcher.rb', line 120 def inspect #:nodoc: "Matcher" end |
#ngrams(str) ⇒ Object
Generate ngrams from a normalized str.
96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/simhilarity/matcher.rb', line 96 def ngrams(str) if ngrammer return ngrammer.call(str) end # two letter ngrams (bigrams) ngrams = str.each_char.each_cons(2).map(&:join) # runs of digits ngrams += str.scan(/\d+/) ngrams.uniq end |
#ngrams_sum(ngrams) ⇒ Object
Sum up the frequency weights of the ngrams
.
109 110 111 |
# File 'lib/simhilarity/matcher.rb', line 109 def ngrams_sum(ngrams) ngrams.map { |i| @freq[i] }.inject(&:+) || 0 end |
#normalize(incoming_str) ⇒ Object
Normalize an incoming string from the user.
82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/simhilarity/matcher.rb', line 82 def normalize(incoming_str) if normalizer return normalizer.call(incoming_str) end str = incoming_str str = str.downcase str = str.gsub(/[^a-z0-9]/, " ") # squish whitespace str = str.gsub(/\s+/, " ").strip str end |
#read(opaque) ⇒ Object
Turn an opaque item from the user into a string.
70 71 72 73 74 75 76 77 78 79 |
# File 'lib/simhilarity/matcher.rb', line 70 def read(opaque) if reader return reader.call(opaque) end if opaque.is_a?(String) return opaque end raise "can't turn #{opaque.inspect} into string" end |