Class: Ai4r::Data::DataSet

Inherits:
Object
  • Object
show all
Defined in:
lib/ai4r/data/data_set.rb

Overview

A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Object

Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.

e.g. DataSet.new(:data_items => data_items, :data_labels => labels)

If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)

Parameters:

  • options (Object) (defaults to: {})


49
50
51
52
53
54
# File 'lib/ai4r/data/data_set.rb', line 49

def initialize(options = {})
  @data_labels = []
  @data_items = options[:data_items] || []
  set_data_labels(options[:data_labels]) if options[:data_labels]
  set_data_items(options[:data_items]) if options[:data_items]
end

Instance Attribute Details

#data_itemsObject (readonly)

Returns the value of attribute data_items.



23
24
25
# File 'lib/ai4r/data/data_set.rb', line 23

def data_items
  @data_items
end

#data_labelsObject (readonly)

Returns the value of attribute data_labels.



23
24
25
# File 'lib/ai4r/data/data_set.rb', line 23

def data_labels
  @data_labels
end

Class Method Details

.normalized(data_set, method: :zscore) ⇒ Object

Return a new DataSet with numeric attributes normalized. Available methods are:

  • :zscore - subtract the mean and divide by the standard deviation

  • :minmax - scale values to the [0,1] range

Parameters:

  • data_set (Object)
  • method (Object) (defaults to: :zscore)

Returns:

  • (Object)


32
33
34
35
36
37
38
# File 'lib/ai4r/data/data_set.rb', line 32

def self.normalized(data_set, method: :zscore)
  new_set = DataSet.new(
    data_items: data_set.data_items.map(&:dup),
    data_labels: data_set.data_labels.dup
  )
  new_set.normalize!(method)
end

Instance Method Details

#<<(data_item) ⇒ Object

Add a data item to the data set

Returns:

  • (Object)


238
239
240
241
242
243
244
245
246
247
248
249
250
# File 'lib/ai4r/data/data_set.rb', line 238

def <<(data_item)
  if data_item.nil? || !data_item.is_a?(Enumerable) || data_item.empty?
    raise ArgumentError, 'Data must not be an non empty array.'
  elsif @data_items.empty?
    set_data_items([data_item])
  elsif data_item.length != num_attributes
    raise ArgumentError, 'Number of attributes do not match. ' \
                         "#{data_item.length} attributes provided, " \
                         "#{num_attributes} attributes expected."
  else
    @data_items << data_item
  end
end

#[](index) ⇒ Object

Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.

Parameters:

  • index (Object)

Returns:

  • (Object)


60
61
62
63
64
65
66
67
68
# File 'lib/ai4r/data/data_set.rb', line 60

def [](index)
  selected_items = if index.is_a?(Integer)
                     [@data_items[index]]
                   else
                     @data_items[index]
                   end
  DataSet.new(data_items: selected_items,
              data_labels: @data_labels)
end

#build_domain(attr) ⇒ Object

Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).

  • Set instance containing all possible values for nominal attributes

  • Array with min and max values for numeric attributes (i.e. [min, max])

    build_domain(“city”)

    > #<Set: York”, “Chicago”>

    build_domain(“age”)

    > [5, 85]

    build_domain(2) # In this example, the third attribute is gender

    > #<Set: “F”>

Parameters:

  • attr (Object)

Returns:

  • (Object)


205
206
207
208
209
210
# File 'lib/ai4r/data/data_set.rb', line 205

def build_domain(attr)
  index = get_index(attr)
  return [Statistics.min(self, index), Statistics.max(self, index)] if @data_items.first[index].is_a?(Numeric)

  @data_items.inject(Set.new) { |domain, x| domain << x[index] }
end

#build_domainsObject

Returns an array with the domain of each attribute:

  • Set instance containing all possible values for nominal attributes

  • Array with min and max values for numeric attributes (i.e. [min, max])

Return example:

> [#<Set: York”, “Chicago”>,

#<Set: {"<30", "[30-50)", "[50-80]", ">80"}>,
#<Set: {"M", "F"}>,
[5, 85],
#<Set: {"Y", "N"}>]

Returns:

  • (Object)


186
187
188
# File 'lib/ai4r/data/data_set.rb', line 186

def build_domains
  @data_labels.collect { |attr_label| build_domain(attr_label) }
end

#category_labelObject

Returns label of category

Returns:

  • (Object)


339
340
341
# File 'lib/ai4r/data/data_set.rb', line 339

def category_label
  data_labels.last
end

#check_not_emptyObject

Raise an exception if there is no data item.

Returns:

  • (Object)

Raises:

  • (ArgumentError)


230
231
232
233
234
# File 'lib/ai4r/data/data_set.rb', line 230

def check_not_empty
  return unless @data_items.empty?

  raise ArgumentError, 'Examples data set must not be empty.'
end

#get_index(attr) ⇒ Object

Returns the index of a given attribute (0-based). For example, if “gender” is the third attribute, then:

get_index("gender")
=> 2

Parameters:

  • attr (Object)

Returns:

  • (Object)


224
225
226
# File 'lib/ai4r/data/data_set.rb', line 224

def get_index(attr)
  attr.is_a?(Integer) || attr.is_a?(Range) ? attr : @data_labels.index(attr)
end

#get_mean_or_modeObject

Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes

Returns:

  • (Object)


255
256
257
258
259
260
261
262
263
264
265
266
# File 'lib/ai4r/data/data_set.rb', line 255

def get_mean_or_mode
  mean = []
  num_attributes.times do |i|
    mean[i] =
      if @data_items.first[i].is_a?(Numeric)
        Statistics.mean(self, i)
      else
        Statistics.mode(self, i)
      end
  end
  mean
end

#load_csv(filepath, parse_numeric: false) ⇒ Object

Load data items from csv file

Parameters:

  • filepath (Object)

Returns:

  • (Object)


73
74
75
76
77
78
79
80
81
82
83
# File 'lib/ai4r/data/data_set.rb', line 73

def load_csv(filepath, parse_numeric: false)
  if parse_numeric
    parse_csv(filepath)
  else
    items = []
    open_csv_file(filepath) do |entry|
      items << entry
    end
    set_data_items(items)
  end
end

#load_csv_with_labels(filepath, parse_numeric: false) ⇒ Object

Load data items from csv file. The first row is used as data labels.

Parameters:

  • filepath (Object)

Returns:

  • (Object)


96
97
98
99
100
# File 'lib/ai4r/data/data_set.rb', line 96

def load_csv_with_labels(filepath, parse_numeric: false)
  load_csv(filepath, parse_numeric: parse_numeric)
  @data_labels = @data_items.shift
  self
end

#normalize!(method = :zscore) ⇒ Object

Normalize numeric attributes in place. Supported methods are :zscore (default) and :minmax.

Parameters:

  • method (Object) (defaults to: :zscore)

Returns:

  • (Object)


272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
# File 'lib/ai4r/data/data_set.rb', line 272

def normalize!(method = :zscore)
  numeric_indices = (0...num_attributes).select do |i|
    @data_items.first[i].is_a?(Numeric)
  end

  case method
  when :zscore
    means = numeric_indices.map { |i| Statistics.mean(self, i) }
    sds = numeric_indices.map { |i| Statistics.standard_deviation(self, i) }
    @data_items.each do |row|
      numeric_indices.each_with_index do |idx, j|
        sd = sds[j]
        row[idx] = sd.zero? ? 0 : (row[idx] - means[j]) / sd
      end
    end
  when :minmax
    mins = numeric_indices.map { |i| Statistics.min(self, i) }
    maxs = numeric_indices.map { |i| Statistics.max(self, i) }
    @data_items.each do |row|
      numeric_indices.each_with_index do |idx, j|
        range = maxs[j] - mins[j]
        row[idx] = range.zero? ? 0 : (row[idx] - mins[j]) / range.to_f
      end
    end
  else
    raise ArgumentError, "Unknown normalization method #{method}"
  end

  self
end

#num_attributesObject

Returns attributes number, including class attribute

Returns:

  • (Object)


214
215
216
# File 'lib/ai4r/data/data_set.rb', line 214

def num_attributes
  @data_items.empty? ? 0 : @data_items.first.size
end

#open_csv_file(filepath) ⇒ Object

Open a CSV file and yield each row to the provided block.

Parameters:

  • filepath (Object)
  • block (Object)

Returns:

  • (Object)


89
90
91
# File 'lib/ai4r/data/data_set.rb', line 89

def open_csv_file(filepath, &)
  CSV.foreach(filepath, &)
end

#parse_csv(filepath) ⇒ Object

Same as load_csv, but it will try to convert cell contents as numbers.

Parameters:

  • filepath (Object)

Returns:

  • (Object)


105
106
107
108
109
110
111
112
113
# File 'lib/ai4r/data/data_set.rb', line 105

def parse_csv(filepath)
  items = []
  open_csv_file(filepath) do |row|
    items << row.collect do |x|
      number?(x) ? Float(x, exception: false) : x
    end
  end
  set_data_items(items)
end

#parse_csv_with_labels(filepath) ⇒ Object

Same as load_csv_with_labels, but it will try to convert cell contents as numbers.

Parameters:

  • filepath (Object)

Returns:

  • (Object)


118
119
120
# File 'lib/ai4r/data/data_set.rb', line 118

def parse_csv_with_labels(filepath)
  load_csv_with_labels(filepath, parse_numeric: true)
end

#set_data_items(items) ⇒ Object

Set the data items. M data items with N attributes must have the following format:

[   [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1,  CLASS_VAL1],
    [ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2,  CLASS_VAL2],
    ...
    [ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM],
]

e.g.

[   ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'F', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '[30-50)',  'M', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[30-50)',  'F', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['New York',  '[50-80]', 'M', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['Chicago',     '>80',      'F', 'Y']
   ]

This method returns the classifier (self), allowing method chaining.

Parameters:

  • items (Object)

Returns:

  • (Object)


168
169
170
171
172
173
# File 'lib/ai4r/data/data_set.rb', line 168

def set_data_items(items)
  check_data_items(items)
  @data_labels = default_data_labels(items) if @data_labels.empty?
  @data_items = items
  self
end

#set_data_labels(labels) ⇒ Object

Set data labels. Data labels must have the following format:

[ 'city', 'age_range', 'gender', 'marketing_target'  ]

If you do not provide labels for you data, the following labels will be created by default:

[ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value'  ]

Parameters:

  • labels (Object)

Returns:

  • (Object)


131
132
133
134
135
# File 'lib/ai4r/data/data_set.rb', line 131

def set_data_labels(labels)
  check_data_labels(labels)
  @data_labels = labels
  self
end

#shuffle!(seed: nil) ⇒ DataSet

Randomizes the order of data items in place. If a seed is provided, it is used to initialize the random number generator for deterministic shuffling.

data_set.shuffle!(seed: 123)

Parameters:

  • seed (Integer, nil) (defaults to: nil)

    Seed for the RNG

Returns:



311
312
313
314
315
# File 'lib/ai4r/data/data_set.rb', line 311

def shuffle!(seed: nil)
  rng = seed ? Random.new(seed) : Random.new
  @data_items.shuffle!(random: rng)
  self
end

#split(ratio:) ⇒ Array<DataSet, DataSet>

Split the dataset into two new DataSet instances using the given ratio for the first set.

train, test = data_set.split(ratio: 0.8)

Parameters:

  • ratio (Float)

    fraction of items to place in the first set

Returns:

Raises:

  • (ArgumentError)


324
325
326
327
328
329
330
331
332
333
334
335
# File 'lib/ai4r/data/data_set.rb', line 324

def split(ratio:)
  raise ArgumentError, 'ratio must be between 0 and 1' unless ratio.positive? && ratio < 1

  pivot = (ratio * @data_items.length).round
  first_items = @data_items[0...pivot].map(&:dup)
  second_items = @data_items[pivot..].map(&:dup)

  [
    DataSet.new(data_items: first_items, data_labels: @data_labels.dup),
    DataSet.new(data_items: second_items, data_labels: @data_labels.dup)
  ]
end