Class: DataModeler::Dataset::Generator

Inherits:
Object
  • Object
show all
Includes:
ConvertingTimeAndIndices, IteratingBasedOnNext
Defined in:
lib/data_modeler/dataset/generator.rb,
lib/data_modeler/helper.rb

Overview

Build train and test datasets for each run of the training.

Train and test sets are seen as moving windows on the data. Alignment is designed to provide continuous testing results over (most of) the data. The following diagram exemplifies this: the training sets ‘t1`, `t2` and `t3` are aligned such that their results can be plotted countinuously against the obserevations. (b) is the amount of data covering for the input+look_ahead window uset for the first target.

data:  ---------------------->  (time, datapoints)
run1:  (b)|train1|t1|       ->  train starts after (b), test after training
run2:        |train2|t2|    ->  train starts after (b) + 1 tset
run3:           |train3|t3| ->  train starts after (b) + 2 tset

Note how the test sets line up. This allows the testing results plots to be continuous, while no model is tested on data on which itself has been trained. All data is used multiple times, alternately both as train and test sets.

Defined Under Namespace

Classes: NoDataLeft, NotEnoughDataError

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from ConvertingTimeAndIndices

#idx, #time

Methods included from IteratingBasedOnNext

#each, #map

Constructor Details

#initialize(data, ds_args:, train_size:, test_size:, min_nruns: 1) ⇒ Generator

Returns a new instance of Generator.

Parameters:

  • data (Hash)

    the data, in an object that can be accessed by keys and return a time series per each key. It is required to include (and be sorted by) a series named ‘:time`, and for all series to have equal length.

  • ds_args (Hash)

    parameters hash for ‘Dataset`s initialization. Keys: `%i[inputs, targets, first_idx, end_idx, ninput_points]`. See `Dataset#initialize` for details.

  • train_size (Integer)

    how many points to expose as targets in each training set

  • test_size (Integer)

    how many points to expose as targets in each test set



30
31
32
33
34
35
36
37
38
39
40
# File 'lib/data_modeler/dataset/generator.rb', line 30

def initialize data, ds_args:, train_size:, test_size:, min_nruns: 1
  @data = data
  @ds_args = ds_args
  @first_idx = first_idx
  @train_size = train_size
  @test_size = test_size
  reset_iteration

  @nrows = data[:time].size
  validate_enough_data_for min_nruns
end

Instance Attribute Details

#dataObject (readonly)

Returns the value of attribute data.



19
20
21
# File 'lib/data_modeler/dataset/generator.rb', line 19

def data
  @data
end

#ds_argsObject (readonly)

Returns the value of attribute ds_args.



19
20
21
# File 'lib/data_modeler/dataset/generator.rb', line 19

def ds_args
  @ds_args
end

#first_idxObject (readonly)

Returns the value of attribute first_idx.



19
20
21
# File 'lib/data_modeler/dataset/generator.rb', line 19

def first_idx
  @first_idx
end

#nrowsObject (readonly)

Returns the value of attribute nrows.



19
20
21
# File 'lib/data_modeler/dataset/generator.rb', line 19

def nrows
  @nrows
end

#test_sizeObject (readonly)

Returns the value of attribute test_size.



19
20
21
# File 'lib/data_modeler/dataset/generator.rb', line 19

def test_size
  @test_size
end

#train_sizeObject (readonly)

Returns the value of attribute train_size.



19
20
21
# File 'lib/data_modeler/dataset/generator.rb', line 19

def train_size
  @train_size
end

Instance Method Details

#nextArray<Dataset, Dataset>

Returns the next pair ‘[trainset, testset]` and increments the counter

Returns:



80
81
82
# File 'lib/data_modeler/dataset/generator.rb', line 80

def next
  peek.tap { @local_nrun += 1 }
end

#peekArray<Dataset, Dataset>

Returns the next pair ‘[trainset, testset]`

Returns:



74
75
76
# File 'lib/data_modeler/dataset/generator.rb', line 74

def peek
  [self.train(@local_nrun), self.test(@local_nrun)]
end

#test(nrun) ⇒ Dataset

Note:

train or test have no meaning alone, and train always comes first. Hence, ‘#train` checks if enough `data` is available for both `train`+`test`.

Builds test sets for model testing

Parameters:

  • nrun (Integer)

    will build different testset for each run

Returns:



62
63
64
65
66
# File 'lib/data_modeler/dataset/generator.rb', line 62

def test nrun
  first = min_eligible_trg + (nrun-1) * test_size + train_size
  last = first + test_size
  DataModeler::Dataset.new data, ds_args.merge(first_idx: first, end_idx: last)
end

#to_aArray<Array<Array<...>>]

Returns an array of arrays (list of inputs-targets pairs)

Returns:

  • (Array<Array<Array<...>>])

    Array<Array<Array<…>>]



94
95
96
97
98
# File 'lib/data_modeler/dataset/generator.rb', line 94

def to_a
  to_ds_a.collect do |train_test_for_run|
    train_test_for_run.collect &:to_a
  end
end

#to_ds_aArray<Array[Dataset]>

Returns an array of datasets

Returns:



91
# File 'lib/data_modeler/dataset/generator.rb', line 91

alias_method :to_ds_a, :to_a

#train(nrun) ⇒ Dataset

Note:

train or test have no meaning alone, and train always comes first. Hence, ‘#train` checks if enough `data` is available for both `train`+`test`.

Builds training sets for model training

Parameters:

  • nrun (Integer)

    will build different trainset for each run

Returns:

Raises:

  • (NoDataLeft)

    when there’s not enough data left for a full train+test



50
51
52
53
54
55
# File 'lib/data_modeler/dataset/generator.rb', line 50

def train nrun
  first = min_eligible_trg + (nrun-1) * test_size
  last = first + train_size
  raise NoDataLeft unless last + test_size < nrows  # make sure there's enough data
  DataModeler::Dataset.new data, ds_args.merge(first_idx: first, end_idx: last)
end