Class: Linkage::Dataset

Inherits:
Object
  • Object
show all
Defined in:
lib/linkage/dataset.rb

Overview

Dataset is a representation of a database table. It is a thin wrapper around a Sequel::Dataset.

There are three ways to create a Dataset.

Pass in a Sequel::Dataset:

Linkage::Dataset.new(db[:foo])

Pass in a Sequel::Database and a table name:

Linkage::Dataset.new(db, :foo)

Pass in a Sequel-style connection URI, a table name, and any options you want to pass to Sequel.connect.

Linkage::Dataset.new("mysql2://example.com/foo", :bar, :user => 'viking', :password => 'secret')

Once you've made a Dataset, you can use any Sequel::Dataset method on it you wish. For example, if you want to limit the dataset to records that refer to people born after 1985 (assuming date of birth is stored as a date type):

filtered_dataset = dataset.where('dob > :date', :date => Date.new(1985, 1, 1))

Note that Sequel::Dataset methods return a clone of a dataset, so you must assign the return value to a variable.

Once you have your Dataset how you want it, you can use the #link_with method to create a Configuration for record linkage. The #link_with method takes another Dataset object and a ResultSet and returns a Configuration.

config = dataset.link_with(other_dataset, result_set)
config.compare([:foo], [:bar], :equal_to)

You can pass in a ScoreSet and MatchSet instead of a ResultSet if you wish:

config = dataset.link_with(other_dataset, score_set, match_set)

Note that a dataset can be linked with itself the same way, like so:

config = dataset.link_with(dataset, result_set)
config.compare([:foo], [:bar], :equal_to)

If you give #link_with a block, it will yield the same Configuration object to the block that it returns.

config = dataset.link_with(other_dataset, result_set) do |c|
  c.compare([:foo], [:bar], :equal_to)
end

Once that's done, use a Runner to run the record linkage:

runner = Linkage::Runner.new(config)
runner.execute

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(dataset) ⇒ Dataset #initialize(database, table_name) ⇒ Dataset #initialize(uri, table_name, options = {}) ⇒ Dataset

Returns a new instance of Linkage::Dataset.

Overloads:

  • #initialize(dataset) ⇒ Dataset

    Use a specific Sequel::Dataset.

    Parameters:

    • dataset (Sequel::Dataset)
  • #initialize(database, table_name) ⇒ Dataset

    Use a specific Sequel::Database.

    Parameters:

    • database (Sequel::Database)
    • table_name (Symbol, String)
  • #initialize(uri, table_name, options = {}) ⇒ Dataset

    Use Sequel.connect to connect to a database.

    Parameters:

    • uri (String, Hash)
    • table_name (Symbol, String)
    • options (Hash) (defaults to: {})


109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/linkage/dataset.rb', line 109

def initialize(*args)
  if args.length == 0 || args.length > 3
    raise ArgumentError, "wrong number of arguments (#{args.length} for 1..3)"
  end

  if args.length == 1
    unless args[0].kind_of?(Sequel::Dataset)
      raise ArgumentError, "expected Sequel::Dataset, got #{args[0].class}"
    end

    @dataset = args[0]
    @db = @dataset.db
    @table_name = @dataset.first_source_table
  elsif args.length == 2 && args[0].kind_of?(Sequel::Database)
    @db = args[0]
    @table_name = args[1].to_sym
    @dataset = @db[@table_name]
  else
    uri, table_name, options = args
    options ||= {}

    @db = Sequel.connect(uri, options)
    @table_name = table_name.to_sym
    @dataset = @db[@table_name]
  end
  @field_set = FieldSet.new(self)
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args, &block) ⇒ Object (protected)

Delegate methods to the underlying Sequel::Dataset.



183
184
185
186
187
188
189
190
191
192
# File 'lib/linkage/dataset.rb', line 183

def method_missing(name, *args, &block)
  result = @dataset.send(name, *args, &block)
  if result.kind_of?(Sequel::Dataset)
    new_object = clone
    new_object.send(:obj=, result)
    new_object
  else
    result
  end
end

Instance Attribute Details

#field_setFieldSet (readonly)

Returns this dataset's FieldSet.

Returns:



91
92
93
# File 'lib/linkage/dataset.rb', line 91

def field_set
  @field_set
end

#table_nameSymbol (readonly)

Returns this dataset's table name.

Returns:

  • (Symbol)

    Returns this dataset's table name.



88
89
90
# File 'lib/linkage/dataset.rb', line 88

def table_name
  @table_name
end

Instance Method Details

Create a Configuration for record linkage.

Parameters:

Returns:



154
155
156
157
158
159
160
161
# File 'lib/linkage/dataset.rb', line 154

def link_with(dataset, result_set)
  other = dataset.eql?(self) ? nil : dataset
  conf = Configuration.new(self, other, result_set)
  if block_given?
    yield conf
  end
  conf
end

#objSequel::Dataset

Returns the underlying Sequel::Dataset.

Returns:

  • (Sequel::Dataset)


139
140
141
# File 'lib/linkage/dataset.rb', line 139

def obj
  @dataset
end

#obj=(value) ⇒ Object (private)

Set the underlying Sequel::Dataset.



144
145
146
# File 'lib/linkage/dataset.rb', line 144

def obj=(value)
  @dataset = value
end

#primary_keyField

Returns:

See Also:



175
176
177
# File 'lib/linkage/dataset.rb', line 175

def primary_key
  @field_set.primary_key
end

#schemaArray

Return the dataset's schema.

Returns:

  • (Array)

See Also:



167
168
169
# File 'lib/linkage/dataset.rb', line 167

def schema
  @db.schema(@table_name)
end