Class: Remi::Transform::DataFrameSieve

Inherits:
Remi::Transform show all
Defined in:
lib/remi/transform.rb

Overview

Public: Applies a DataFrame grouping sieve.

The DataFrame sieve can be used to simplify very complex nested if-then logic to group data into buckets. Given a DataFrame with N columns, the first N-1 columns represent the variables needed to group data into buckets. The last column is the desired group. The sieve then progresses down the rows of the DataFrame and checks to see if the input data matches the values in the columns of the sieve. Nils in the sieve are treated as wildcards and match anything. The first row that matches wins and the sieve progression stops.

sieve_df - The sieve, defined as a dataframe. The names of the sieve vectors must correspond to the names of the vectors in the dataframe source to target map. The last vector in the sieve_df is used as the result of the sieve.

Examples:

# This sieve captures the following business logic # 1 - All Non-Graduate Nursing, regardless of contact, gets assigned to the :intensive group. # 2 - All Undergraduate programs with contact get assigned to the :intensive group. # 3 - All Undergraduate programs without a contact get assigned to the :base group. # 4 - All Graduate engineering programs with a contact get assigned to the :intensive group. # 5 - All other programs get assigned to the :base group sieve_df = Daru::DataFrame.new([ [ 'Undergrad' , 'NURS' , nil , :intensive ], [ 'Undergrad' , nil , true , :intensive ], [ 'Undergrad' , nil , false , :base ], [ 'Grad' , 'ENG' , true , :intensive ], [ nil , nil , nil , :base ], ].transpose, order: [:level, :program, :contact, :group] )

test_df = Daru::DataFrame.new([ ['Undergrad' , 'CHEM' , false], ['Undergrad' , 'CHEM' , true], ['Grad' , 'CHEM' , true], ['Undergrad' , 'NURS' , false], ['Unknown' , 'CHEM' , true], ].transpose, order: [:level, :program, :contact] )

Remi::SourceToTargetMap.apply(test_df) do map source(:level, :program, :contact,) .target(:group) .transform(Remi::Transform::DataFrameSieve.new(sieve_df)) end

test_df # => # level program contact group 0 Undergrad CHEM nil base 1 Undergrad CHEM true intensive 2 Grad CHEM true base 3 Undergrad NURS nil intensive 4 Unknown CHEM true base

Instance Attribute Summary

Attributes inherited from Remi::Transform

#multi_args, #source_metadata, #target_metadata

Instance Method Summary collapse

Methods inherited from Remi::Transform

#call, #to_proc

Constructor Details

#initialize(sieve_df, *args, **kargs, &block) ⇒ DataFrameSieve

Returns a new instance of DataFrameSieve.



674
675
676
677
# File 'lib/remi/transform.rb', line 674

def initialize(sieve_df, *args, **kargs, &block)
  super
  @sieve_table = sieve_df.transpose.to_h.values
end

Instance Method Details

#transform(row) ⇒ Object

Raises:

  • (ArgumentError)


680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
# File 'lib/remi/transform.rb', line 680

def transform(row)
  sieve_keys = @sieve_table.first.index.to_a
  sieve_result_key = sieve_keys.pop

  raise ArgumentError, "#{sieve_keys - row.source_keys} not found in row" unless (sieve_keys - row.source_keys).size == 0

  @sieve_table.each.find do |sieve_row|
    match_row = true
    sieve_keys.each do |sieve_key|
      match_value = if sieve_row[sieve_key].is_a?(Regexp)
                      !!sieve_row[sieve_key].match(row[sieve_key])
                    else
                      sieve_row[sieve_key] == row[sieve_key]
                    end

      match_row &&= sieve_row[sieve_key].nil? || match_value
    end
    match_row
  end[sieve_result_key]
end