loose_tight_dictionary
Match things based on string similarity (using the Pair Distance algorithm) and regular expressions.
Quickstart
>> d = LooseTightDictionary.new %w{seamus andy ben}
=> [...]
>> puts d.find 'Shamus Heaney'
=> 'seamus'
Try running the included example file:
$ ruby examples/first_name_matching.rb
######################################################################################################################################################
# Match "Mr. Seamus" => "seamus"
######################################################################################################################################################
Needle
(needle_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"mr. seamus"
Haystack
(haystack_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"seamus"
"andy"
"ben"
Tighteners
------------------------------------------------------------------------------------------------------------------------------------------------------
(none)
Comparisons
Score t_haystack [=> tightened/prefixed] t_needle [=> tightened/prefixed]
------------------------------------------------------------------------------------------------------------------------------------------------------
0.8333333333333334 "seamus" "mr. seamus"
0.0 "andy" "mr. seamus"
0.0 "ben" "mr. seamus"
Match
------------------------------------------------------------------------------------------------------------------------------------------------------
"seamus"
# [... there's more output ...]
The Boeing example
From the tests:
######################################################################################################################################################
# Match "BOEING 737100" => "BOEING BOEING 737-100/200"
######################################################################################################################################################
Needle
(needle_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"boeing 737100"
Haystack
(haystack_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"boeing boeing 737-100/200"
"boeing boeing 737-900"
Tighteners
------------------------------------------------------------------------------------------------------------------------------------------------------
/(7\d)(7|0)-?(\d{1,3})/i
Comparisons
Score t_haystack [=> tightened/prefixed] t_needle [=> tightened/prefixed]
------------------------------------------------------------------------------------------------------------------------------------------------------
1.0 "boeing boeing 737-100/200" => "737100" "boeing 737100" => "737100"
0.6666666666666666 "boeing boeing 737-100/200" => "737100" "boeing 737100"
0.6153846153846154 "boeing boeing 737-900" "boeing 737100"
0.6 "boeing boeing 737-900" => "737900" "boeing 737100" => "737100"
0.6 "boeing boeing 737-100/200" "boeing 737100"
0.4 "boeing boeing 737-900" => "737900" "boeing 737100"
0.32 "boeing boeing 737-100/200" "boeing 737100" => "737100"
0.2857142857142857 "boeing boeing 737-900" "boeing 737100" => "737100"
Match
------------------------------------------------------------------------------------------------------------------------------------------------------
"BOEING BOEING 737-100/200"
Improving dictionaries
Similarity matching will only get you so far.
TODO: regex usage
Note on Patches/Pull Requests
-
Fork the project.
-
Make your feature addition or bug fix.
-
Add tests for it. This is important so I don’t break it in a future version unintentionally.
-
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
-
Send me a pull request. Bonus points for topic branches.
Copyright
Copyright © 2011 Seamus Abshere. See LICENSE for details.