Class: Linkage::Configuration

Inherits:
Object
  • Object
show all
Defined in:
lib/linkage/configuration.rb

Overview

Configuration keeps track of everything needed to run a record linkage, including which datasets you want to link, how you want to link them, and where you want to store the results. Once created, you can supply the Configuration to Runner#initialize and run it with Runner#execute.

To create a configuration, usually you will want to use Dataset#link_with, but you can create it directly if you like (see #initialize), like so:

dataset_1 = Linkage::Dataset.new('mysql://example.com/database_name', 'foo')
dataset_2 = Linkage::Dataset.new('postgres://example.com/other_name', 'bar')
result_set = Linkage::ResultSet['csv'].new('/home/foo/linkage')
config = Linkage::Configuration.new(dataset_1, dataset_2, result_set)

To add comparators to Configuration, you can call methods with the same name as registered comparators. Here's the list of builtin comparators:

Name Class
compare Linkage::Comparators::Compare
strcompare Linkage::Comparators::Strcompare
within Linkage::Comparators::Within

For example, if you want to add a Linkage::Comparators::Compare comparator to your configuration, run this:

config.compare([:foo], [:bar], :equal_to)

This works via #method_missing. First, the comparator class is fetched via Linkage::Comparator.[]. Then fields are looked up in the FieldSet of the Dataset. Those Fields along with any other arguments you specify are passed to the constructor of the comparator you chose.

Configuration also contains information about how records are matched. Once scores are computed, the scores for each pair of records are averaged and compared against a threshold value. Record pairs that have an average score greater than or equal to the threshold value are considered matches.

The threshold value is 0.5 by default, but you can change it by setting #threshold like so:

config.threshold = 0.75

Since scores range between 0 and 1 (inclusive), be sure to set a threshold value within the same range. The actual matching work is done by the Matcher class.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dataset_1, dataset_2, result_set) ⇒ Configuration #initialize(dataset, result_set) ⇒ Configuration #initialize(dataset_1, dataset_2, score_set, match_set) ⇒ Configuration #initialize(dataset, score_set, match_set) ⇒ Configuration

Create a new instance of Linkage::Configuration.

Overloads:



93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/linkage/configuration.rb', line 93

def initialize(*args)
  if args.length < 2 || args.length > 4
    raise ArgumentError, "wrong number of arguments (#{args.length} for 2..4)"
  end

  @dataset_1 = args[0]
  case args.length
  when 2
    # dataset and result set
    @result_set = args[1]
  when 3
    # dataset 1, dataset 2, and result set
    # dataset, score set, and match set
    case args[1]
    when Dataset, nil
      @dataset_2 = args[1]
      @result_set = args[2]
    when ScoreSet
      @result_set = ResultSet.new(args[1], args[2])
    end
  when 4
    # dataset 1, dataset 2, score set, and match set
    @dataset_2 = args[1]
    @result_set = ResultSet.new(args[2], args[3])
  end

  @comparators = []
  @algorithm = :mean
  @threshold = 0.5
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args, &block) ⇒ Object



142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/linkage/configuration.rb', line 142

def method_missing(name, *args, &block)
  klass = Comparator[name.to_s]
  if klass.nil?
    raise "unknown comparator: #{name}"
  end

  set_1 = args[0]
  if set_1.is_a?(Array)
    set_1 = fields_for(dataset_1, *set_1)
  else
    set_1 = fields_for(dataset_1, set_1).first
  end
  args[0] = set_1

  set_2 = args[1]
  if set_2.is_a?(Array)
    set_2 = fields_for(dataset_2 || dataset_1, *set_2)
  else
    set_2 = fields_for(dataset_2 || dataset_1, set_2).first
  end
  args[1] = set_2

  comparator = klass.new(*args, &block)
  @comparators << comparator
  return comparator
end

Instance Attribute Details

#algorithmObject

Returns the value of attribute algorithm.



61
62
63
# File 'lib/linkage/configuration.rb', line 61

def algorithm
  @algorithm
end

#comparatorsObject (readonly)

Returns the value of attribute comparators.



60
61
62
# File 'lib/linkage/configuration.rb', line 60

def comparators
  @comparators
end

#dataset_1Object (readonly)

Returns the value of attribute dataset_1.



60
61
62
# File 'lib/linkage/configuration.rb', line 60

def dataset_1
  @dataset_1
end

#dataset_2Object (readonly)

Returns the value of attribute dataset_2.



60
61
62
# File 'lib/linkage/configuration.rb', line 60

def dataset_2
  @dataset_2
end

#result_setObject (readonly)

Returns the value of attribute result_set.



60
61
62
# File 'lib/linkage/configuration.rb', line 60

def result_set
  @result_set
end

#thresholdObject

Returns the value of attribute threshold.



60
61
62
# File 'lib/linkage/configuration.rb', line 60

def threshold
  @threshold
end

Instance Method Details

#match_recorder(matcher) ⇒ Object



138
139
140
# File 'lib/linkage/configuration.rb', line 138

def match_recorder(matcher)
  MatchRecorder.new(matcher, @result_set.match_set)
end

#matcherObject



134
135
136
# File 'lib/linkage/configuration.rb', line 134

def matcher
  Matcher.new(@comparators, @result_set.score_set, @algorithm, @threshold)
end

#score_recorderObject



124
125
126
127
128
129
130
131
132
# File 'lib/linkage/configuration.rb', line 124

def score_recorder
  pk_1 = @dataset_1.field_set.primary_key.name
  if @dataset_2
    pk_2 = @dataset_2.field_set.primary_key.name
  else
    pk_2 = pk_1
  end
  ScoreRecorder.new(@comparators, @result_set.score_set, [pk_1, pk_2])
end