About
It's a implementation of Jaro-Winkler distance algorithm, it uses C extension and will fallback to pure Ruby version in JRuby. Both implementation supports UTF-8 string.
Windows Issue
It will fallabck to pure Ruby implementation on Windows since it can't be compiled currently. (ref #1)
Installation
gem install jaro_winkler
Usage
require 'jaro_winkler'
JaroWinkler.distance "MARTHA", "MARHTA"
# => 0.9611
JaroWinkler.distance "MARTHA", "marhta", case_match: true
# => 0.9611
JaroWinkler.distance "MARTHA", "MARHTA", weight: 0.2
# => 0.9778
# Force the strategy
JaroWinkler.c_distance "MARTHA", "MARHTA" # C extension
JaroWinkler.r_distance "MARTHA", "MARHTA" # Pure Ruby
Both implementations support UTF-8 string.
Options
| Name | Type | Default | Note |
|---|---|---|---|
| case_match | boolean | false | All lower case characters are converted to upper case prior to the comparison. |
| weight | number | 0.1 | A constant scaling factor for how much the score is adjusted upwards for having common prefixes. |
| threshold | number | 0.7 | The prefix bonus is only added when the compared strings have a Jaro distance above the threshold. |
Why This?
There is also another gem named fuzzy-string-match, it uses the same algorithm and both provides C and Ruby implementation.
I reinvent this wheel because of the naming in fuzzy-string-match such as getDistance breaks convention, and some weird code like a1 = s1.split( // ) (s1.chars could be better), furthermore, it's bugged:
| string 1 | string 2 | origin | fuzzy-string-match | jaro_winkler |
|---|---|---|---|---|
| "henka" | "henkan" | 0.966667 | 0.9722 (wrong) | 0.9666666666666667 |
| "al" | "al" | 1.000000 | 1.0 | 1.0 |
| "martha" | "marhta" | 0.961111 | 0.9611 | 0.9611111111111111 |
| "jones" | "johnson" | 0.832381 | 0.8323 | 0.8323809523809523 |
| "abcvwxyz" | "cabvwxyz" | 0.958333 | 0.9583 | 0.9583333333333333 |
| "dwayne" | "duane" | 0.840000 | 0.8400 | 0.84 |
| "dixon" | "dicksonx" | 0.813333 | 0.8133 | 0.8133333333333332 |
| "fvie" | "ten" | 0.000000 | 0.0 | 0 |
- The origin result is from the original C implementation by the author of the algorithm.
- Test data are borrowed from fuzzy-string-match's rspec file.
Benchmark
- jaro_winkler (1.0.1)
- fuzzy-string-match (0.9.6)
require 'benchmark'
require 'jaro_winkler'
require 'fuzzystringmatch'
ary = [['al', 'al'], ['martha', 'marhta'], ['jones', 'johnson'], ['abcvwxyz', 'cabvwxyz'], ['dwayne', 'duane'], ['dixon', 'dicksonx'], ['fvie', 'ten']]
n = 100000
Benchmark.bmbm do |x|
x.report 'jaro_winkler ' do
n.times{ ary.each{ |str1, str2| JaroWinkler.distance(str1, str2) } }
end
x.report 'fuzzystringmatch' do
jarow = FuzzyStringMatch::JaroWinkler.create(:pure)
n.times{ ary.each{ |str1, str2| jarow.getDistance(str1, str2) } }
end
end
# user system total real
# jaro_winkler 12.480000 0.010000 12.490000 ( 12.497828)
# fuzzystringmatch 14.990000 0.010000 15.000000 ( 15.014898)
Todo
- Adjusting word table (Reference to original C implementation.)