Module: Edits::JaroWinkler
- Defined in:
- lib/edits/jaro_winkler.rb
Overview
Constant Summary collapse
- WINKLER_PREFIX_WEIGHT =
Prefix scaling factor for jaro-winkler metric. Default is 0.1 Should not exceed 0.25 or metric range will leave 0..1
0.1- WINKLER_THRESHOLD =
Threshold for boosting Jaro with winkler prefix multiplier. Default is 0.7
0.7
Class Method Summary collapse
-
.distance(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float
Calculate Jaro-Winkler distance.
-
.similarity(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float
Calculate Jaro-Winkler similarity of given strings.
Class Method Details
.distance(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float
Note:
Not a true distance metric, fails to satisfy triangle inequality.
Calculate Jaro-Winkler distance
64 65 66 67 68 69 70 |
# File 'lib/edits/jaro_winkler.rb', line 64 def self.distance( seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT ) 1.0 - similarity(seq1, seq2, threshold: threshold, weight: weight) end |
.similarity(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float
Calculate Jaro-Winkler similarity of given strings
Adds weight to Jaro distance according to the length of a common prefix of up to 4 letters, where exists. The additional weighting is only applied when the original distance passes a threshold.
Sw = Sj + (l * p * (1 - Dj))
Where Sj is Jaro, l is prefix length, and p is prefix weight
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/edits/jaro_winkler.rb', line 33 def self.similarity( seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT ) dj = Jaro.similarity(seq1, seq2) if dj > threshold # size of common prefix, max 4 max_bound = seq1.length > seq2.length ? seq2.length : seq1.length max_bound = 4 if max_bound > 4 l = 0 l += 1 until seq1[l] != seq2[l] || l >= max_bound l < 1 ? dj : dj + (l * weight * (1 - dj)) else dj end end |