Class: Linkage::Comparators::Compare
- Inherits:
-
Linkage::Comparator
- Object
- Linkage::Comparator
- Linkage::Comparators::Compare
- Defined in:
- lib/linkage/comparators/compare.rb
Overview
Compare is the most basic comparator in Linkage, conceptually. It scores two records based on whether or not field values satisfy the specified operator. Score is either 0 or 1.
To use Compare, you must specify two sets of fields to use in the comparison, along with an operator. Valid operators are:
:equal
:not_equal
:greater_than
:greater_than_or_equal
:less_than
:less_than_or_equal
Sets of fields must be of equal length. If you specify more than one field, each field will be compared to its counterpart in the other set. All of the field values must meet the conditions in order for the score to be 1. Otherwise, the score is 0.
Consider the following example, using a Linkage::Configuration as part of Dataset#link_with:
config.compare([:foo, :bar], [:baz, :qux], :equal)
For each record, the values of foo
and baz
are compared, and the
values of bar
and qux
are compared. If both of these two comparisons
are true
, then the score of 1 is given. If foo
and baz
are equal but
bar
and qux
are not equal, or if both comparisons are false, then a
score of 0 is given.
Algorithms
The way records are chosen for comparison depends on which operator you
use. The :equal
operator is treated differently than the other
operators. When using operators other than :equal
, each record is
compared to every other record (and Linkage::Comparator#type returns :simple
). When using
:equal
, Linkage::Comparator#type is :advanced
and a different algorithm is used.
"Equal" mode uses an algorithm similar to the sorted neighborhood method. Values are sorted (via database query) and then compared. This way, only adjacent records are compared. Using the transitive property of equality, records are grouped together. All pairs of records in the group are scored as 1. Scores of 0 are not given at all (absence of score means 0).
Constant Summary collapse
- VALID_OPERATIONS =
[ :not_equal, :greater_than, :greater_than_or_equal, :less_than_or_equal, :less_than, :equal ]
Instance Attribute Summary
Attributes inherited from Linkage::Comparator
Instance Method Summary collapse
- #_score_datasets(dataset_1, dataset_2) ⇒ Object private
-
#initialize(set_1, set_2, operation) ⇒ Compare
constructor
A new instance of Compare.
- #score(record_1, record_2) ⇒ Object
- #score_dataset(dataset) ⇒ Object
- #score_datasets(dataset_1, dataset_2) ⇒ Object
- #score_group(group) ⇒ Object private
- #score_groups(group_1, group_2) ⇒ Object private
Methods inherited from Linkage::Comparator
klass_for, register, #score_and_notify, #type, #weigh
Constructor Details
#initialize(set_1, set_2, operation) ⇒ Compare
Returns a new instance of Compare.
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/linkage/comparators/compare.rb', line 55 def initialize(set_1, set_2, operation) if set_1.length != set_2.length raise "sets must be of equal length" end # Check value data types set_1.each_with_index do |value_1, index| value_2 = set_2[index] if value_1.ruby_type != value_2.ruby_type raise "values at index #{index} had different types" end end # Check compare operator if !VALID_OPERATIONS.include?(operation) raise "operation is not valid" end @type = operation == :equal ? :advanced : :simple @names_1 = set_1.collect(&:name) @names_2 = set_2.collect(&:name) @operation = operation end |
Instance Method Details
#_score_datasets(dataset_1, dataset_2) ⇒ Object (private)
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
# File 'lib/linkage/comparators/compare.rb', line 150 def _score_datasets(dataset_1, dataset_2) enum_1 = dataset_1.order(*@names_1).to_enum enum_2 = dataset_2.order(*@names_2).to_enum begin record_1 = enum_1.next record_2 = enum_2.next rescue StopIteration # no pairs to score return end group_1 = [] group_2 = [] loop do value_1 = record_1.values_at(*@names_1) value_2 = record_2.values_at(*@names_2) result = value_1 <=> value_2 if result == 0 last_value = value_1 group_1 << record_1 group_2 << record_2 state = :right loop do begin case state when :left record_1 = enum_1.next value_1 = record_1.values_at(*@names_1) result = last_value == value_1 when :right record_2 = enum_2.next value_2 = record_2.values_at(*@names_2) result = last_value == value_2 end rescue StopIteration result = false case state when :left record_1 = :eof when :right record_2 = :eof end end if result case state when :left group_1 << record_1 when :right group_2 << record_2 end else case state when :left # done with this group score_groups(group_1, group_2) group_1.clear group_2.clear break when :right state = :left end end end if record_1 == :eof || record_2 == :eof break end else begin if result < 0 record_1 = enum_1.next else record_2 = enum_2.next end rescue StopIteration break end end end end |
#score(record_1, record_2) ⇒ Object
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/linkage/comparators/compare.rb', line 78 def score(record_1, record_2) values_1 = record_1.values_at(*@names_1) values_2 = record_2.values_at(*@names_2) result = case @operation when :not_equal values_1.each_with_index.all? do |value_1, i| value_1 != values_2[i] end when :greater_than values_1.each_with_index.all? do |value_1, i| value_1 > values_2[i] end when :greater_than_or_equal values_1.each_with_index.all? do |value_1, i| value_1 >= values_2[i] end when :less_than_or_equal values_1.each_with_index.all? do |value_1, i| value_1 <= values_2[i] end when :less_than values_1.each_with_index.all? do |value_1, i| value_1 < values_2[i] end end result ? 1 : 0 end |
#score_dataset(dataset) ⇒ Object
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# File 'lib/linkage/comparators/compare.rb', line 114 def score_dataset(dataset) # FIXME: nil value equality if @names_1 != @names_2 return _score_datasets(dataset, dataset) end enum = dataset.order(*@names_1).to_enum begin record = enum.next rescue StopIteration return end group = [record] last_value = record.values_at(*@names_1) loop do begin record = enum.next rescue StopIteration break end value = record.values_at(*@names_1) if value == last_value group << record else score_group(group) group.clear group << record last_value = value end end score_group(group) end |
#score_datasets(dataset_1, dataset_2) ⇒ Object
108 109 110 111 112 |
# File 'lib/linkage/comparators/compare.rb', line 108 def score_datasets(dataset_1, dataset_2) # FIXME: nil value equality _score_datasets(dataset_1, dataset_2) end |
#score_group(group) ⇒ Object (private)
241 242 243 244 245 246 247 248 |
# File 'lib/linkage/comparators/compare.rb', line 241 def score_group(group) (group.length - 1).times do |i| ((i+1)...group.length).each do |j| changed notify_observers(self, group[i], group[j], 1) end end end |
#score_groups(group_1, group_2) ⇒ Object (private)
232 233 234 235 236 237 238 239 |
# File 'lib/linkage/comparators/compare.rb', line 232 def score_groups(group_1, group_2) group_1.each do |record_1| group_2.each do |record_2| changed notify_observers(self, record_1, record_2, 1) end end end |