Class: PEROBS::FuzzyStringMatcher

Inherits:
Object show all
Defined in:
lib/perobs/FuzzyStringMatcher.rb

Overview

The fuzzy string matcher can be used to perform a fuzzy string search against a known set of strings. The dictionary of known strings does not store the actual strings but references to String or PEROBS objects. Once the dictionary has been established, fuzzy matches can be done. Since the actual input strings are not directly stored, you cannot remove or modified already stored strings. To remove strings, you have to clear the matcher and add the strings again that you want to keep.

Constant Summary

Constants inherited from ObjectBase

ObjectBase::NATIVE_CLASSES

Instance Attribute Summary

Attributes inherited from Object

#attributes

Attributes inherited from ObjectBase

#_id, #myself, #store

Instance Method Summary collapse

Methods inherited from Object

#_delete_reference_to_id, #_deserialize, #_referenced_object_ids, #attr_init, attr_persist, #init_attr, #inspect, #mark_as_modified

Methods inherited from ObjectBase

#==, #_check_assignment_value, _finalize, #_initialize, #_restore, #_stash, #_sync, #_transfer, read, #restore

Constructor Details

#initialize(p, case_sensitive = false, n = 4) ⇒ FuzzyStringMatcher

Create a new FuzzyStringMatcher.

Parameters:

  • p (PEROBS::Store)

    place to store the dictionary

  • case_sensitive (Boolean) (defaults to: false)

    True if case matters for matching

  • n (Integer) (defaults to: 4)

    Determines what kind of n-gramm is used to store the references in the dictionary. It also determines the minimum word length that can be used for fuzzy matches. Values between 2 and 10 are supported. The default is 4.



51
52
53
54
55
56
57
58
59
60
# File 'lib/perobs/FuzzyStringMatcher.rb', line 51

def initialize(p, case_sensitive = false, n = 4)
  super(p)
  if n < 2 || n > 10
    raise ArgumentError, 'n must be between 2 and 10'
  end
  self.case_sensitive = case_sensitive
  self.n = n

  clear unless @dict
end

Instance Method Details

#best_matches(string, min_score = 0.5, max_count = 100) ⇒ Array

Find the references who’s string best matches the given string.

Parameters:

  • string (String)

    string to search for

  • min_score (Float) (defaults to: 0.5)

    Value 0.01 and 1.0 that specifies how strict the matching should be done. The larger the value the more closer the given string needs to be.

  • max_count (Integer) (defaults to: 100)

    The maximum number of matches that should be returned.

Returns:

  • (Array)

    The result is an Array of Arrays. The nested Arrays only have 2 entries. The reference and a Float value between 0 and 1.0 that describes how good the match is. The matches are sorted in descending order by the match score.



102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'lib/perobs/FuzzyStringMatcher.rb', line 102

def best_matches(string, min_score = 0.5, max_count = 100)
  unless @case_sensitive
    string = string.downcase
  end
  # Enclose string in 'start of text' and 'end of text' ASCII values.
  string = "\002" + string + "\003"

  matches = {}

  each_n_gramm(string) do |n_gramm|
    if (ng_list = @dict[n_gramm])
      ng_list.each do |reference, dummy|
        if matches.include?(reference)
          matches[reference] += 1
        else
          matches[reference] = 1
        end
      end
    end
  end

  return [] if matches.empty?

  match_list = matches.to_a

  # Set occurance counters to scores relative to the best possible score.
  # This will be the best possible score for a perfect match.
  best_possible_score = string.length - @n + 1
  match_list.map! { |a, b| [ a, b.to_f / best_possible_score ] }

  # Delete all matches that don't have the required minimum match score.
  match_list.delete_if { |a| a[1] < min_score }

  # Sort the list best to worst match
  match_list.sort! do |a, b|
    b[1] <=> a[1]
  end

  # Return the top max_count matches.
  match_list[0..max_count - 1]
end

#clearObject

Wipe the dictionary.



63
64
65
# File 'lib/perobs/FuzzyStringMatcher.rb', line 63

def clear
  self.dict = @store.new(BigHash)
end

#learn(string, reference = string) ⇒ Object

Add a string with its reference to the dictionary.

Parameters:

  • string (String)

    The string to store

  • reference (Object) (defaults to: string)

    Any object that is associated with the string



70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/perobs/FuzzyStringMatcher.rb', line 70

def learn(string, reference = string)
  reference = string if reference.nil?

  unless @case_sensitive
    string = string.downcase
  end
  # Enclose string in 'start of text' and 'end of text' ASCII values.
  string = "\002" + string + "\003"

  each_n_gramm(string) do |n_gramm|
    unless (ng_list = @dict[n_gramm])
      @dict[n_gramm] = ng_list = @store.new(Hash)
    end

    # We use the Hash as a Set. The value doesn't matter.
    ng_list[reference] = true unless ng_list.include?(reference)
  end

  nil
end

#statsObject

Returns some internal stats about the dictionary.



145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/perobs/FuzzyStringMatcher.rb', line 145

def stats
  s = {}
  s['dictionary_size'] = @dict.size
  max = total = 0
  @dict.each do |n_gramm, ng_list|
    size = ng_list.length
    max = size if size > max
    total += size
  end
  s['max_list_size'] = max
  s['avg_list_size'] = total > 0 ? total.to_f / s['dictionary_size'] : 0

  s
end