Class: PEROBS::FuzzyStringMatcher
- Inherits:
-
Object
- Object
- ObjectBase
- Object
- PEROBS::FuzzyStringMatcher
- Defined in:
- lib/perobs/FuzzyStringMatcher.rb
Overview
The fuzzy string matcher can be used to perform a fuzzy string search against a known set of strings. The dictionary of known strings does not store the actual strings but references to String or PEROBS objects. Once the dictionary has been established, fuzzy matches can be done. Since the actual input strings are not directly stored, you cannot remove or modified already stored strings. To remove strings, you have to clear the matcher and add the strings again that you want to keep.
Constant Summary
Constants inherited from ObjectBase
Instance Attribute Summary
Attributes inherited from Object
Attributes inherited from ObjectBase
Instance Method Summary collapse
-
#best_matches(string, min_score = 0.5, max_count = 100) ⇒ Array
Find the references who’s string best matches the given string.
-
#clear ⇒ Object
Wipe the dictionary.
-
#initialize(p, case_sensitive = false, n = 4) ⇒ FuzzyStringMatcher
constructor
Create a new FuzzyStringMatcher.
-
#learn(string, reference = string) ⇒ Object
Add a string with its reference to the dictionary.
-
#stats ⇒ Object
Returns some internal stats about the dictionary.
Methods inherited from Object
#_delete_reference_to_id, #_deserialize, #_referenced_object_ids, #attr_init, attr_persist, #init_attr, #inspect, #mark_as_modified
Methods inherited from ObjectBase
#==, #_check_assignment_value, _finalize, #_initialize, #_restore, #_stash, #_sync, #_transfer, read, #restore
Constructor Details
#initialize(p, case_sensitive = false, n = 4) ⇒ FuzzyStringMatcher
Create a new FuzzyStringMatcher.
51 52 53 54 55 56 57 58 59 60 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 51 def initialize(p, case_sensitive = false, n = 4) super(p) if n < 2 || n > 10 raise ArgumentError, 'n must be between 2 and 10' end self.case_sensitive = case_sensitive self.n = n clear unless @dict end |
Instance Method Details
#best_matches(string, min_score = 0.5, max_count = 100) ⇒ Array
Find the references who’s string best matches the given string.
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 102 def best_matches(string, min_score = 0.5, max_count = 100) unless @case_sensitive string = string.downcase end # Enclose string in 'start of text' and 'end of text' ASCII values. string = "\002" + string + "\003" matches = {} each_n_gramm(string) do |n_gramm| if (ng_list = @dict[n_gramm]) ng_list.each do |reference, dummy| if matches.include?(reference) matches[reference] += 1 else matches[reference] = 1 end end end end return [] if matches.empty? match_list = matches.to_a # Set occurance counters to scores relative to the best possible score. # This will be the best possible score for a perfect match. best_possible_score = string.length - @n + 1 match_list.map! { |a, b| [ a, b.to_f / best_possible_score ] } # Delete all matches that don't have the required minimum match score. match_list.delete_if { |a| a[1] < min_score } # Sort the list best to worst match match_list.sort! do |a, b| b[1] <=> a[1] end # Return the top max_count matches. match_list[0..max_count - 1] end |
#clear ⇒ Object
Wipe the dictionary.
63 64 65 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 63 def clear self.dict = @store.new(BigHash) end |
#learn(string, reference = string) ⇒ Object
Add a string with its reference to the dictionary.
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 70 def learn(string, reference = string) reference = string if reference.nil? unless @case_sensitive string = string.downcase end # Enclose string in 'start of text' and 'end of text' ASCII values. string = "\002" + string + "\003" each_n_gramm(string) do |n_gramm| unless (ng_list = @dict[n_gramm]) @dict[n_gramm] = ng_list = @store.new(Hash) end # We use the Hash as a Set. The value doesn't matter. ng_list[reference] = true unless ng_list.include?(reference) end nil end |
#stats ⇒ Object
Returns some internal stats about the dictionary.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 145 def stats s = {} s['dictionary_size'] = @dict.size max = total = 0 @dict.each do |n_gramm, ng_list| size = ng_list.length max = size if size > max total += size end s['max_list_size'] = max s['avg_list_size'] = total > 0 ? total.to_f / s['dictionary_size'] : 0 s end |