Class: Remi::Transform::DataFrameSieve
- Inherits:
-
Remi::Transform
- Object
- Remi::Transform
- Remi::Transform::DataFrameSieve
- Defined in:
- lib/remi/transform.rb
Overview
Public: Applies a DataFrame grouping sieve.
The DataFrame sieve can be used to simplify very complex nested if-then logic to group data into buckets. Given a DataFrame with N columns, the first N-1 columns represent the variables needed to group data into buckets. The last column is the desired group. The sieve then progresses down the rows of the DataFrame and checks to see if the input data matches the values in the columns of the sieve. Nils in the sieve are treated as wildcards and match anything. The first row that matches wins and the sieve progression stops.
sieve_df - The sieve, defined as a dataframe. The names of the
sieve vectors must correspond to the names of the
vectors in the dataframe source to target map. The
last vector in the sieve_df is used as the result of the sieve.
Examples:
# This sieve captures the following business logic
# 1 - All Non-Graduate Nursing, regardless of contact, gets assigned to the :intensive group.
# 2 - All Undergraduate programs with contact get assigned to the :intensive group.
# 3 - All Undergraduate programs without a contact get assigned to the :base group.
# 4 - All Graduate engineering programs with a contact get assigned to the :intensive group.
# 5 - All other programs get assigned to the :base group
sieve_df = Daru::DataFrame.new([
[ 'Undergrad' , 'NURS' , nil , :intensive ],
[ 'Undergrad' , nil , true , :intensive ],
[ 'Undergrad' , nil , false , :base ],
[ 'Grad' , 'ENG' , true , :intensive ],
[ nil , nil , nil , :base ],
].transpose,
order: [:level, :program, :contact, :group]
)
test_df = Daru::DataFrame.new([
['Undergrad' , 'CHEM' , false],
['Undergrad' , 'CHEM' , true],
['Grad' , 'CHEM' , true],
['Undergrad' , 'NURS' , false],
['Unknown' , 'CHEM' , true],
].transpose,
order: [:level, :program, :contact]
)
Remi::SourceToTargetMap.apply(test_df) do
map source(:level, :program, :contact,) .target(:group)
.transform(Remi::Transform::DataFrameSieve.new(sieve_df))
end
test_df
# => #<Daru::DataFrame:70099624408400 @name = d30888fd-6ca8-48dd-9be3-558f81ae1015 @size = 5>
level program contact group
0 Undergrad CHEM nil base
1 Undergrad CHEM true intensive
2 Grad CHEM true base
3 Undergrad NURS nil intensive
4 Unknown CHEM true base
Instance Attribute Summary
Attributes inherited from Remi::Transform
#multi_args, #source_metadata, #target_metadata
Instance Method Summary collapse
-
#initialize(sieve_df, *args, **kargs, &block) ⇒ DataFrameSieve
constructor
A new instance of DataFrameSieve.
- #transform(row) ⇒ Object
Methods inherited from Remi::Transform
Constructor Details
#initialize(sieve_df, *args, **kargs, &block) ⇒ DataFrameSieve
Returns a new instance of DataFrameSieve.
665 666 667 668 |
# File 'lib/remi/transform.rb', line 665 def initialize(sieve_df, *args, **kargs, &block) super @sieve_table = sieve_df.transpose.to_h.values end |
Instance Method Details
#transform(row) ⇒ Object
671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 |
# File 'lib/remi/transform.rb', line 671 def transform(row) sieve_keys = @sieve_table.first.index.to_a sieve_result_key = sieve_keys.pop raise ArgumentError, "#{sieve_keys - row.source_keys} not found in row" unless (sieve_keys - row.source_keys).size == 0 @sieve_table.each.find do |sieve_row| match_row = true sieve_keys.each do |sieve_key| match_value = if sieve_row[sieve_key].is_a?(Regexp) !!sieve_row[sieve_key].match(row[sieve_key]) else sieve_row[sieve_key] == row[sieve_key] end match_row &&= sieve_row[sieve_key].nil? || match_value end match_row end[sieve_result_key] end |