loose_tight_dictionary
Match things based on string similarity (using the Pair Distance algorithm) and regular expressions.
Quickstart
>> require 'loose_tight_dictionary'
=> true
>> LooseTightDictionary.new(%w{seamus andy ben}).find('Shamus')
=> "seamus"
String similarity matching
Exclusively uses Dice’s Coefficient algorithm (aka Pair Distance).
Production use
Over 2 years in Brighter Planet’s environmental impact API and reference data service.
Haystacks and how to read them
The (admittedly imperfect) metaphor is “look for a needle in a haystack”
-
needle - the search term
-
haystack - the records you are searching (your result will be an object from here)
So, what if your needle is a string like youruguay
and your haystack is full of Country
objects like <Country name:"Uruguay">
?
>> LooseTightDictionary.new(countries, :read => :name).find('youruguay')
=> <Country name:"Uruguay">
Regular expressions
You can improve the default matchings with regular expressions.
-
Emphasize important words using blockings and tighteners
-
Filter out stop words with tighteners
-
Prevent impossible matches with blockings and identities
Blockings
Setting a blocking of /Airbus/
ensures that strings containing “Airbus” will only be scored against to other strings containing “Airbus”. A better blocking in this case would probably be /airbus/i
.
Tighteners
Adding a tightener like /(boeing).*(7\d\d)/i
will cause “BOEING COMPANY 747” and “boeing747” to be scored as if they were “BOEING 747” and “boeing 747”, respectively. See also “Case sensitivity” below.
Identities
Adding an identity like /(F)\-?(\d50)/
ensures that “Ford F-150” and “Ford F-250” never match.
Case sensitivity
Scoring is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
Examples
Check out the tests.
Speed
If you add the amatch gem to your Gemfile, it will use that, which is much faster (but segfaults have been seen in the wild). Thanks Flori!
Otherwise, a pure ruby version derived from the answer to a StackOverflow question is used. Thanks marzagao!
Authors
-
Seamus Abshere <[email protected]>
-
Ian Hough <[email protected]>
-
Andy Rossmeissl <[email protected]>
Copyright
Copyright 2011 Brighter Planet, Inc.