loose_tight_dictionary

Match things based on string similarity (using the Pair Distance algorithm) and regular expressions.

Quickstart

>> d = LooseTightDictionary.new %w{seamus andy ben}
=> [...]
>> puts d.find 'Shamus Heaney'
=> 'seamus'

Try running the included example file:

$ ruby examples/first_name_matching.rb 
######################################################################################################################################################
# Match "Mr. Seamus" => "seamus"
######################################################################################################################################################

Needle
(needle_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"mr. seamus"

Haystack
(haystack_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"seamus"
"andy"
"ben"

Tighteners
------------------------------------------------------------------------------------------------------------------------------------------------------
(none)

Comparisons
Score                                             t_haystack [=> tightened/prefixed]                t_needle [=> tightened/prefixed]                  
------------------------------------------------------------------------------------------------------------------------------------------------------
0.8333333333333334                                "seamus"                                          "mr. seamus"
0.0                                               "andy"                                            "mr. seamus"
0.0                                               "ben"                                             "mr. seamus"

Match
------------------------------------------------------------------------------------------------------------------------------------------------------
"seamus"

# [... there's more output ...]

The Boeing example

From the tests:

######################################################################################################################################################
# Match "BOEING 737100" => "BOEING BOEING 737-100/200"
######################################################################################################################################################

Needle
(needle_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"boeing 737100"

Haystack
(haystack_reader proc not defined, so downcasing everything)
------------------------------------------------------------------------------------------------------------------------------------------------------
"boeing boeing 737-100/200"
"boeing boeing 737-900"

Tighteners
------------------------------------------------------------------------------------------------------------------------------------------------------
/(7\d)(7|0)-?(\d{1,3})/i

Comparisons
Score                                             t_haystack [=> tightened/prefixed]                t_needle [=> tightened/prefixed]                  
------------------------------------------------------------------------------------------------------------------------------------------------------
1.0                                               "boeing boeing 737-100/200" => "737100"           "boeing 737100" => "737100"
0.6666666666666666                                "boeing boeing 737-100/200" => "737100"           "boeing 737100"
0.6153846153846154                                "boeing boeing 737-900"                           "boeing 737100"
0.6                                               "boeing boeing 737-900" => "737900"               "boeing 737100" => "737100"
0.6                                               "boeing boeing 737-100/200"                       "boeing 737100"
0.4                                               "boeing boeing 737-900" => "737900"               "boeing 737100"
0.32                                              "boeing boeing 737-100/200"                       "boeing 737100" => "737100"
0.2857142857142857                                "boeing boeing 737-900"                           "boeing 737100" => "737100"

Match
------------------------------------------------------------------------------------------------------------------------------------------------------
"BOEING BOEING 737-100/200"

Improving dictionaries

Similarity matching will only get you so far.

TODO: regex usage

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)

  • Send me a pull request. Bonus points for topic branches.

Copyright © 2011 Seamus Abshere. See LICENSE for details.