Class: Ferret::Search::FuzzyTermEnum

Inherits:
FilteredTermEnum show all
Includes:
Index, MonitorMixin
Defined in:
lib/ferret/search/fuzzy_term_enum.rb

Overview

Subclass of FilteredTermEnum for enumerating all terms that are similiar to the specified filter term.

Term enumerations are always ordered by Term.compareTo(). Each term in the enumeration is greater than all that precede it.

Constant Summary collapse

TYPICAL_LONGEST_WORD_IN_INDEX =

This should be somewhere around the average long word. If it is longer, we waste time and space. If it is shorter, we waste a little bit of time growing the array as we encounter longer words.

19

Instance Attribute Summary collapse

Attributes inherited from FilteredTermEnum

#term

Instance Method Summary collapse

Methods inherited from FilteredTermEnum

#close, #doc_freq, #enum=, #next?

Methods inherited from Index::TermEnum

#close, #doc_freq, #next?, #skip_to, #term

Constructor Details

#initialize(reader, term, minimum_similarity = FuzzyQuery.default_min_similarity, prefix_length = FuzzyQuery.default_prefix_length) ⇒ FuzzyTermEnum

Constructor for enumeration of all terms from specified reader which share a prefix of length prefix_length with term and which have a fuzzy similarity > min_similarity.

After calling the constructor the enumeration is already pointing to the first valid term if such a term exists.

reader

Delivers terms.

term

Pattern term.

min_similarity

Minimum required similarity for terms from the reader.

Default value is 0.5.

prefix_length

Length of required common prefix. Default value is 0.



32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 32

def initialize(reader, term,
               minimum_similarity = FuzzyQuery.default_min_similarity,
               prefix_length = FuzzyQuery.default_prefix_length)
  super()
  
  @reader = reader
  @end_enum = false
  @max_distances = Array.new(TYPICAL_LONGEST_WORD_IN_INDEX)

  
  if (minimum_similarity >= 1.0)
    raise ArgumentError, "minimum_similarity cannot be greater than or equal to 1"
  elsif (minimum_similarity < 0.0)
    raise ArgumentError, "minimum_similarity cannot be less than 0"
  end
  if(prefix_length < 0)
    raise ArgumentError, "prefix_length cannot be less than 0"
  end

  @minimum_similarity = minimum_similarity
  @scale_factor = 1.0 / (1.0 - @minimum_similarity)
  @search_term = term
  @field = @search_term.field

  # The prefix could be longer than the word.

  # It's kind of silly though.  It means we must match the entire word.

  term_length = @search_term.text.length
  if prefix_length > term_length 
    @prefix_length = term_length 
  else
    @prefix_length = prefix_length
  end

  @text = @search_term.text[@prefix_length..-1]
  @prefix = @search_term.text[0, @prefix_length]

  initialize_max_distances()

  # Allows us save time required to create a new array

  # everytime similarity is called.

  @d = init_distance_array()

  self.enum = reader.terms_from(Term.new(@search_term.field, @prefix))
end

Instance Attribute Details

#end_enumObject (readonly)

Returns the value of attribute end_enum.



13
14
15
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 13

def end_enum
  @end_enum
end

Instance Method Details

#calculate_max_distance(m) ⇒ Object



246
247
248
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 246

def calculate_max_distance(m) 
  return ((1-@minimum_similarity) * ([@text.length, m].min + @prefix_length))
end

#differenceObject



89
90
91
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 89

def difference() 
  return  (@scale_factor * (@similarity - @minimum_similarity))
end

#grow_distance_array(m) ⇒ Object

Grow the second dimension of the array, so that we can calculate the Levenshtein difference.



224
225
226
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 224

def grow_distance_array(m) 
  @d = @d.map {Array.new(m+1)}
end

#init_distance_arrayObject



103
104
105
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 103

def init_distance_array()
  return Array.new(@text.length() + 1) {Array.new(TYPICAL_LONGEST_WORD_IN_INDEX)}
end

#initialize_max_distancesObject



240
241
242
243
244
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 240

def initialize_max_distances() 
  @max_distances.length.times do |i|
    @max_distances[i] = calculate_max_distance(i)
  end
end

#max_distance(m) ⇒ Object

The max Distance is the maximum Levenshtein distance for the text compared to some other value that results in score that is better than the minimum similarity.

m

the length of the “other value”

returns

the maximum levenshtein distance that we care about



233
234
235
236
237
238
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 233

def max_distance(m) 
  if (m >= @max_distances.length)
    @max_distances[m] = calculate_max_distance(m)
  end
  return @max_distances[m]
end

#min(a, b, c) ⇒ Object

Finds and returns the smallest of three integers



98
99
100
101
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 98

def min(a, b, c) 
  t = (a < b) ? a : b
  return (t < c) ? t : c
end

#similarity(target) ⇒ Object

Similarity returns a number that is 1.0 or less (including negative numbers) based on how similar the Term is compared to a target term. It returns exactly 0.0 when

edit_distance < maximum_edit_distance

Otherwise it returns:

1 - (edit_distance / length)

where length is the length of the shortest term (text or target) including a prefix that are identical and edit_distance is the Levenshtein distance for the two words.

Embedded within this algorithm is a fail-fast Levenshtein distance algorithm. The fail-fast algorithm differs from the standard Levenshtein distance algorithm in that it is aborted if it is discovered that the mimimum distance between the words is greater than some threshold.

To calculate the maximum distance threshold we use the following formula:

(1 - minimum_similarity) * length

where length is the shortest term including any prefix that is not part of the similarity comparision. This formula was derived by solving for what maximum value of distance returns false for the following statements:

similarity = 1 - (distance / (prefix_length + [textlen, targetlen].min))
return (similarity > minimum_similarity)

where distance is the Levenshtein distance for the two words.

Levenshtein distance (also known as edit distance) is a measure of similiarity between two strings where the distance is measured as the number of character deletions, insertions or substitutions required to transform one string to the other string.

target

the target word or phrase

returns

the similarity, 0.0 or less indicates that it matches less

than the required threshold and 1.0 indicates that the text and
target are identical


150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 150

def similarity(target) 
  synchronize do
    m = target.length
    n = @text.length

    if (n == 0)  
      # we don't have anything to compare.  That means if we just add the

      # letters for m we get the new word

      return (@prefix_length == 0) ? 0.0 : 1.0 - (m.to_f / @prefix_length)
    end
    if (m == 0) 
      return (@prefix_length == 0) ? 0.0 : 1.0 - (n.to_f / @prefix_length)
    end

    max_distance = max_distance(m)

    if (max_distance < (m-n).abs) 
      #just adding the characters of m to n or vice-versa results in

      #too many edits

      #for example "pre" length is 3 and "prefixes" length is 8.  We can see that

      #given this optimal circumstance, the edit distance cannot be less than 5.

      #which is 8-3 or more precisesly Math.abs(3-8).

      #if our maximum edit distance is 4, then we can discard this word

      #without looking at it.

      return 0.0
    end

    #let's make sure we have enough room in our array to do the distance calculations.

    if (@d[0].length <= m) 
      grow_distance_array(m)
    end

    # init matrix d

    (n+1).times {|i| @d[i][0] = i}
    (m+1).times {|j| @d[0][j] = j}
    
    # start computing edit distance

    1.upto(n) do |i|
      best_possible_edit_distance = m
      s_i = @text[i-1]
      1.upto(m) do |j|
        if (s_i != target[j-1]) 
          @d[i][j] = min(@d[i-1][j], @d[i][j-1], @d[i-1][j-1])+1
        else 
          @d[i][j] = min(@d[i-1][j]+1, @d[i][j-1]+1, @d[i-1][j-1])
        end
        if @d[i][j] < best_possible_edit_distance
          best_possible_edit_distance = @d[i][j]
        end
      end

      # After calculating row i, the best possible edit distance can be

      # found by found by finding the smallest value in a given column.

      # If the best_possible_edit_distance is greater than the max distance,

      # abort.

      if (i > max_distance and best_possible_edit_distance > max_distance)
        # equal is okay, but not greater

        # the closest the target can be to the text is just too far away.

        # this target is leaving the party early.

        return 0.0
      end
    end

    # this will return less than 0.0 when the edit distance is

    # greater than the number of characters in the shorter word.

    # but this was the formula that was previously used in FuzzyTermEnum,

    # so it has not been changed (even though minimum_similarity must be

    # greater than 0.0)

    return 1.0 - (@d[n][m].to_f / (@prefix_length + (n < m ? n : m)))
  end
end

#term_compare(term) ⇒ Object

The term_compare method in FuzzyTermEnum uses Levenshtein distance to calculate the distance between the given term and the comparing term.



79
80
81
82
83
84
85
86
87
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 79

def term_compare(term) 
  if (@field == term.field and term.text[0, @prefix_length] == @prefix) 
    target = term.text[@prefix_length..-1]
    @similarity = similarity(target)
    return (@similarity > @minimum_similarity)
  end
  @end_enum = true
  return false
end