Class: Remi::Transform::DataFrameSieve

Inherits:
Remi::Transform show all
Defined in:
lib/remi/transform.rb

Overview

Public: Applies a DataFrame grouping sieve.

The DataFrame sieve can be used to simplify very complex nested if-then logic to group data into buckets. Given a DataFrame with N columns, the first N-1 columns represent the variables needed to group data into buckets. The last column is the desired group. The sieve then progresses down the rows of the DataFrame and checks to see if the input data matches the values in the columns of the sieve. Nils in the sieve are treated as wildcards and match anything. The first row that matches wins and the sieve progression stops.

sieve_df - The sieve, defined as a dataframe. The names of the

sieve vectors must correspond to the names of the
vectors in the dataframe source to target map.  The
last vector in the sieve_df is used as the result of the sieve.

Examples:

# This sieve captures the following business logic
# 1 - All Non-Graduate Nursing, regardless of contact, gets assigned to the :intensive group.
# 2 - All Undergraduate programs with contact get assigned to the :intensive group.
# 3 - All Undergraduate programs without a contact get assigned to the :base group.
# 4 - All Graduate engineering programs with a contact get assigned to the :intensive group.
# 5 - All other programs get assigned to the :base group
sieve_df = Daru::DataFrame.new([
  [ 'Undergrad' , 'NURS' , nil   , :intensive ],
  [ 'Undergrad' , nil    , true  , :intensive ],
  [ 'Undergrad' , nil    , false , :base ],
  [ 'Grad'      , 'ENG'  , true  , :intensive ],
  [ nil         , nil    , nil   , :base ],
  ].transpose,
  order: [:level, :program, :contact, :group]
  )

test_df = Daru::DataFrame.new([
  ['Undergrad' , 'CHEM' , false],
  ['Undergrad' , 'CHEM' , true],
  ['Grad'      , 'CHEM' , true],
  ['Undergrad' , 'NURS' , false],
  ['Unknown'   , 'CHEM' , true],
  ].transpose,
  order: [:level, :program, :contact]
)

Remi::SourceToTargetMap.apply(test_df) do
  map source(:level, :program, :contact,) .target(:group)
  .transform(Remi::Transform::DataFrameSieve.new(sieve_df))
end

test_df
# =>  #<Daru::DataFrame:70099624408400 @name = d30888fd-6ca8-48dd-9be3-558f81ae1015 @size = 5>
          level    program    contact      group
   0  Undergrad       CHEM        nil       base
   1  Undergrad       CHEM       true  intensive
   2       Grad       CHEM       true       base
   3  Undergrad       NURS        nil  intensive
   4    Unknown       CHEM       true       base

Instance Attribute Summary

Attributes inherited from Remi::Transform

#multi_args, #source_metadata, #target_metadata

Instance Method Summary collapse

Methods inherited from Remi::Transform

#call, #to_proc

Constructor Details

#initialize(sieve_df, *args, **kargs, &block) ⇒ DataFrameSieve

Returns a new instance of DataFrameSieve.



665
666
667
668
# File 'lib/remi/transform.rb', line 665

def initialize(sieve_df, *args, **kargs, &block)
  super
  @sieve_table = sieve_df.transpose.to_h.values
end

Instance Method Details

#transform(row) ⇒ Object

Raises:

  • (ArgumentError)


671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
# File 'lib/remi/transform.rb', line 671

def transform(row)
  sieve_keys = @sieve_table.first.index.to_a
  sieve_result_key = sieve_keys.pop

  raise ArgumentError, "#{sieve_keys - row.source_keys} not found in row" unless (sieve_keys - row.source_keys).size == 0

  @sieve_table.each.find do |sieve_row|
    match_row = true
    sieve_keys.each do |sieve_key|
      match_value = if sieve_row[sieve_key].is_a?(Regexp)
                      !!sieve_row[sieve_key].match(row[sieve_key])
                    else
                      sieve_row[sieve_key] == row[sieve_key]
                    end

      match_row &&= sieve_row[sieve_key].nil? || match_value
    end
    match_row
  end[sieve_result_key]
end