Linkage

Linkage is a Ruby library for record linkage between one or two database tables.

What is record linkage?

In an ideal world, records that reference the same entity can be easily identified. Unfortunately, this isn't always the case. Sometimes there are no good identifiers in the datasets that you're interested in (ID, social security number, etc). In such cases, it is necessary to use other means to determine which records refer to which entity, and this process is known as record linkage.

Prerequisites

In order to use Linkage, the records you want to link must be in a database. Linkage has the ability to perform record linkage across different kinds of databases, so it's okay if your records are not all in the same place.

Since Linkage uses Sequel to communicate with databases, any database that Sequel supports will work. See Connecting to a database on the Sequel website for more information about what databases are supported.

Usage

To perform a record linkage, Linkage needs information about the following: datasets, result set, and comparators. A dataset refers to a table in a database. A result set is a place to put score and match information that Linkage generates. Comparators describe how records are compared.

A dataset is created via the Linkage::Dataset class, along with a connection URI and a table name:

ds = Linkage::Dataset.new('mysql://example.com/database_name', 'table_name')

Result sets have different options depending on what storage medium you're using (CSV or database). For CSVs, you could use:

result_set = Linkage::ResultSet['csv'].new('~/my_results')

In this case, scores and matches will be saved in CSV files in the my_results directory in your home folder.

To describe a linkage, you can use the Dataset#link_with method. This creates a linkage configuration that you can use to describe how you want the records in each dataset to be compared. For example:

demo = Linkage::Dataset.new('postgres://example.com/foo', 'demographics')
visits = Linkage::Dataset.new('mysql://some-other-host.net/bar', 'visits')
result_set = Linkage::ResultSet['csv'].new('~/my_results')
config = demo.link_with(visits, result_set) do |config|
  config.compare([:first_name, :last_name], [:first_name, :last_name], :equal)
end

This linkage would match records from a demographics table to records in a table with information about doctor visits by using first name and last name.

The compare method creates a Compare comparator. This is the simplest comparator in Linkage, and it just compares fields with the operator you specify (:equal, :less_than, :greater_than, etc). When a comparator compares two records, it gives the pair of records a score between 0 and 1. In the case of the example above, records that have the same first name and last name get a score of 1, and records that don't get a score of 0 (or sometimes, they aren't scored and assumed to have a score of 0).

Other comparators are Strcompare for approximate string matching and Within for matching numbers within a range.

To run a linkage, use a Runner with the resulting configuration from Dataset#link_with:

runner = Linkage::Runner.new(config)
runner.execute

After running a linkage, there will be a list of matches in a CSV file or database, depending on how you configured your result set.

The default way linkage determines if two records match is by comparing the average score to a threshold value (which is 0.5 by default). You can configure the threshold value like so: config.threshold = 0.9.

Other examples

Linking a dataset to itself:

births = Linkage::Dataset.new('postgres://example.com/hospital_data', 'births')
result_set = Linkage::ResultSet['csv'].new('~/my_birth_results')
config = births.link_with(births, result_set) do |config|
  config.compare([:mother_first_name, :mother_last_name], [:mother_first_name, :mother_last_name], :equal)
end
runner = Linkage::Runner.new(config)
runner.execute

The above example would find birth records that have mothers with the same name.

Contributing to linkage

  • Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
  • Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
  • Fork the project
  • Start a feature/bugfix branch
  • Commit and push until you are happy with your contribution
  • Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.

Copyright

Copyright (c) 2011-2014 Vanderbilt University. See LICENSE.txt for further details.