Class: Ferret::Search::FuzzyTermEnum
- Inherits:
-
FilteredTermEnum
- Object
- Index::TermEnum
- FilteredTermEnum
- Ferret::Search::FuzzyTermEnum
- Includes:
- Index, MonitorMixin
- Defined in:
- lib/ferret/search/fuzzy_term_enum.rb
Overview
Subclass of FilteredTermEnum for enumerating all terms that are similiar to the specified filter term.
Term enumerations are always ordered by Term.compareTo(). Each term in the enumeration is greater than all that precede it.
Constant Summary collapse
- TYPICAL_LONGEST_WORD_IN_INDEX =
This should be somewhere around the average long word. If it is longer, we waste time and space. If it is shorter, we waste a little bit of time growing the array as we encounter longer words.
19
Instance Attribute Summary collapse
-
#end_enum ⇒ Object
readonly
Returns the value of attribute end_enum.
Attributes inherited from FilteredTermEnum
Instance Method Summary collapse
- #calculate_max_distance(m) ⇒ Object
- #difference ⇒ Object
-
#grow_distance_array(m) ⇒ Object
Grow the second dimension of the array, so that we can calculate the Levenshtein difference.
- #init_distance_array ⇒ Object
-
#initialize(reader, term, minimum_similarity = FuzzyQuery.default_min_similarity, prefix_length = FuzzyQuery.default_prefix_length) ⇒ FuzzyTermEnum
constructor
Constructor for enumeration of all terms from specified
reader
which share a prefix of lengthprefix_length
withterm
and which have a fuzzy similarity >min_similarity
. - #initialize_max_distances ⇒ Object
-
#max_distance(m) ⇒ Object
The max Distance is the maximum Levenshtein distance for the text compared to some other value that results in score that is better than the minimum similarity.
-
#min(a, b, c) ⇒ Object
Finds and returns the smallest of three integers.
-
#similarity(target) ⇒ Object
Similarity returns a number that is 1.0 or less (including negative numbers) based on how similar the Term is compared to a target term.
-
#term_compare(term) ⇒ Object
The term_compare method in FuzzyTermEnum uses Levenshtein distance to calculate the distance between the given term and the comparing term.
Methods inherited from FilteredTermEnum
#close, #doc_freq, #enum=, #next?
Methods inherited from Index::TermEnum
#close, #doc_freq, #next?, #skip_to, #term
Constructor Details
#initialize(reader, term, minimum_similarity = FuzzyQuery.default_min_similarity, prefix_length = FuzzyQuery.default_prefix_length) ⇒ FuzzyTermEnum
Constructor for enumeration of all terms from specified reader
which share a prefix of length prefix_length
with term
and which have a fuzzy similarity > min_similarity
.
After calling the constructor the enumeration is already pointing to the first valid term if such a term exists.
- reader
-
Delivers terms.
- term
-
Pattern term.
- min_similarity
-
Minimum required similarity for terms from the reader.
Default value is 0.5.
- prefix_length
-
Length of required common prefix. Default value is 0.
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 32 def initialize(reader, term, minimum_similarity = FuzzyQuery.default_min_similarity, prefix_length = FuzzyQuery.default_prefix_length) super() @reader = reader @end_enum = false @max_distances = Array.new(TYPICAL_LONGEST_WORD_IN_INDEX) if (minimum_similarity >= 1.0) raise ArgumentError, "minimum_similarity cannot be greater than or equal to 1" elsif (minimum_similarity < 0.0) raise ArgumentError, "minimum_similarity cannot be less than 0" end if(prefix_length < 0) raise ArgumentError, "prefix_length cannot be less than 0" end @minimum_similarity = minimum_similarity @scale_factor = 1.0 / (1.0 - @minimum_similarity) @search_term = term @field = @search_term.field # The prefix could be longer than the word. # It's kind of silly though. It means we must match the entire word. term_length = @search_term.text.length if prefix_length > term_length @prefix_length = term_length else @prefix_length = prefix_length end @text = @search_term.text[@prefix_length..-1] @prefix = @search_term.text[0, @prefix_length] initialize_max_distances() # Allows us save time required to create a new array # everytime similarity is called. @d = init_distance_array() self.enum = reader.terms_from(Term.new(@search_term.field, @prefix)) end |
Instance Attribute Details
#end_enum ⇒ Object (readonly)
Returns the value of attribute end_enum.
13 14 15 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 13 def end_enum @end_enum end |
Instance Method Details
#calculate_max_distance(m) ⇒ Object
246 247 248 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 246 def calculate_max_distance(m) return ((1-@minimum_similarity) * ([@text.length, m].min + @prefix_length)) end |
#difference ⇒ Object
89 90 91 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 89 def difference() return (@scale_factor * (@similarity - @minimum_similarity)) end |
#grow_distance_array(m) ⇒ Object
Grow the second dimension of the array, so that we can calculate the Levenshtein difference.
224 225 226 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 224 def grow_distance_array(m) @d = @d.map {Array.new(m+1)} end |
#init_distance_array ⇒ Object
103 104 105 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 103 def init_distance_array() return Array.new(@text.length() + 1) {Array.new(TYPICAL_LONGEST_WORD_IN_INDEX)} end |
#initialize_max_distances ⇒ Object
240 241 242 243 244 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 240 def initialize_max_distances() @max_distances.length.times do |i| @max_distances[i] = calculate_max_distance(i) end end |
#max_distance(m) ⇒ Object
The max Distance is the maximum Levenshtein distance for the text compared to some other value that results in score that is better than the minimum similarity.
- m
-
the length of the “other value”
- returns
-
the maximum levenshtein distance that we care about
233 234 235 236 237 238 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 233 def max_distance(m) if (m >= @max_distances.length) @max_distances[m] = calculate_max_distance(m) end return @max_distances[m] end |
#min(a, b, c) ⇒ Object
Finds and returns the smallest of three integers
98 99 100 101 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 98 def min(a, b, c) t = (a < b) ? a : b return (t < c) ? t : c end |
#similarity(target) ⇒ Object
Similarity returns a number that is 1.0 or less (including negative numbers) based on how similar the Term is compared to a target term. It returns exactly 0.0 when
edit_distance < maximum_edit_distance
Otherwise it returns:
1 - (edit_distance / length)
where length is the length of the shortest term (text or target) including a prefix that are identical and edit_distance is the Levenshtein distance for the two words.
Embedded within this algorithm is a fail-fast Levenshtein distance algorithm. The fail-fast algorithm differs from the standard Levenshtein distance algorithm in that it is aborted if it is discovered that the mimimum distance between the words is greater than some threshold.
To calculate the maximum distance threshold we use the following formula:
(1 - minimum_similarity) * length
where length is the shortest term including any prefix that is not part of the similarity comparision. This formula was derived by solving for what maximum value of distance returns false for the following statements:
similarity = 1 - (distance / (prefix_length + [textlen, targetlen].min))
return (similarity > minimum_similarity)
where distance is the Levenshtein distance for the two words.
Levenshtein distance (also known as edit distance) is a measure of similiarity between two strings where the distance is measured as the number of character deletions, insertions or substitutions required to transform one string to the other string.
- target
-
the target word or phrase
- returns
-
the similarity, 0.0 or less indicates that it matches less
than the required threshold and 1.0 indicates that the text and
target are identical
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 150 def similarity(target) synchronize do m = target.length n = @text.length if (n == 0) # we don't have anything to compare. That means if we just add the # letters for m we get the new word return (@prefix_length == 0) ? 0.0 : 1.0 - (m.to_f / @prefix_length) end if (m == 0) return (@prefix_length == 0) ? 0.0 : 1.0 - (n.to_f / @prefix_length) end max_distance = max_distance(m) if (max_distance < (m-n).abs) #just adding the characters of m to n or vice-versa results in #too many edits #for example "pre" length is 3 and "prefixes" length is 8. We can see that #given this optimal circumstance, the edit distance cannot be less than 5. #which is 8-3 or more precisesly Math.abs(3-8). #if our maximum edit distance is 4, then we can discard this word #without looking at it. return 0.0 end #let's make sure we have enough room in our array to do the distance calculations. if (@d[0].length <= m) grow_distance_array(m) end # init matrix d (n+1).times {|i| @d[i][0] = i} (m+1).times {|j| @d[0][j] = j} # start computing edit distance 1.upto(n) do |i| best_possible_edit_distance = m s_i = @text[i-1] 1.upto(m) do |j| if (s_i != target[j-1]) @d[i][j] = min(@d[i-1][j], @d[i][j-1], @d[i-1][j-1])+1 else @d[i][j] = min(@d[i-1][j]+1, @d[i][j-1]+1, @d[i-1][j-1]) end if @d[i][j] < best_possible_edit_distance best_possible_edit_distance = @d[i][j] end end # After calculating row i, the best possible edit distance can be # found by found by finding the smallest value in a given column. # If the best_possible_edit_distance is greater than the max distance, # abort. if (i > max_distance and best_possible_edit_distance > max_distance) # equal is okay, but not greater # the closest the target can be to the text is just too far away. # this target is leaving the party early. return 0.0 end end # this will return less than 0.0 when the edit distance is # greater than the number of characters in the shorter word. # but this was the formula that was previously used in FuzzyTermEnum, # so it has not been changed (even though minimum_similarity must be # greater than 0.0) return 1.0 - (@d[n][m].to_f / (@prefix_length + (n < m ? n : m))) end end |
#term_compare(term) ⇒ Object
The term_compare method in FuzzyTermEnum uses Levenshtein distance to calculate the distance between the given term and the comparing term.
79 80 81 82 83 84 85 86 87 |
# File 'lib/ferret/search/fuzzy_term_enum.rb', line 79 def term_compare(term) if (@field == term.field and term.text[0, @prefix_length] == @prefix) target = term.text[@prefix_length..-1] @similarity = similarity(target) return (@similarity > @minimum_similarity) end @end_enum = true return false end |