Class: Ai4r::Data::DataSet

Inherits:

Object

Object
Ai4r::Data::DataSet

show all

Defined in:: lib/ai4r/data/data_set.rb

Overview

A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.

Instance Attribute Summary collapse

#data_items ⇒ Object readonly

Returns the value of attribute data_items.
#data_labels ⇒ Object readonly

Returns the value of attribute data_labels.

Class Method Summary collapse

.normalized(data_set, method: :zscore) ⇒ Object

Return a new DataSet with numeric attributes normalized.

Instance Method Summary collapse

#<<(data_item) ⇒ Object

Add a data item to the data set.
#[](index) ⇒ Object

Retrieve a new DataSet, with the item(s) selected by the provided index.
#build_domain(attr) ⇒ Object

Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).
#build_domains ⇒ Object

Returns an array with the domain of each attribute: * Set instance containing all possible values for nominal attributes * Array with min and max values for numeric attributes (i.e. [min, max]).
#category_label ⇒ Object

Returns label of category.
#check_not_empty ⇒ Object

Raise an exception if there is no data item.
#get_index(attr) ⇒ Object

Returns the index of a given attribute (0-based).
#get_mean_or_mode ⇒ Object

Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes.
#initialize(options = {}) ⇒ Object constructor

Create a new DataSet.
#load_csv(filepath, parse_numeric: false) ⇒ Object

Load data items from csv file.
#load_csv_with_labels(filepath, parse_numeric: false) ⇒ Object

Load data items from csv file.
#normalize!(method = :zscore) ⇒ Object

Normalize numeric attributes in place.
#num_attributes ⇒ Object

Returns attributes number, including class attribute.
#open_csv_file(filepath) ⇒ Object

Open a CSV file and yield each row to the provided block.
#parse_csv(filepath) ⇒ Object

Same as load_csv, but it will try to convert cell contents as numbers.
#parse_csv_with_labels(filepath) ⇒ Object

Same as load_csv_with_labels, but it will try to convert cell contents as numbers.
#set_data_items(items) ⇒ Object

Set the data items.
#set_data_labels(labels) ⇒ Object

Set data labels.
#shuffle!(seed: nil) ⇒ DataSet

Randomizes the order of data items in place.
#split(ratio:) ⇒ Array<DataSet, DataSet>

Split the dataset into two new DataSet instances using the given ratio for the first set.

Constructor Details

#initialize(options = {}) ⇒ `Object`

Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.

e.g. DataSet.new(:data_items => data_items, :data_labels => labels)

If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)

Parameters:

options (Object) (defaults to: {})

# File 'lib/ai4r/data/data_set.rb', line 49

def initialize(options = {})
  @data_labels = []
  @data_items = options[:data_items] || []
  set_data_labels(options[:data_labels]) if options[:data_labels]
  set_data_items(options[:data_items]) if options[:data_items]
end

Instance Attribute Details

#data_items ⇒ `Object` (readonly)

Returns the value of attribute data_items.



23
24
25

# File 'lib/ai4r/data/data_set.rb', line 23

def data_items
  @data_items
end

#data_labels ⇒ `Object` (readonly)

Returns the value of attribute data_labels.



23
24
25

# File 'lib/ai4r/data/data_set.rb', line 23

def data_labels
  @data_labels
end

Class Method Details

.normalized(data_set, method: :zscore) ⇒ `Object`

Return a new DataSet with numeric attributes normalized. Available methods are:

:zscore - subtract the mean and divide by the standard deviation
:minmax - scale values to the [0,1] range

Parameters:

data_set (Object)
method (Object) (defaults to: :zscore)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 32

def self.normalized(data_set, method: :zscore)
  new_set = DataSet.new(
    data_items: data_set.data_items.map(&:dup),
    data_labels: data_set.data_labels.dup
  )
  new_set.normalize!(method)
end

Instance Method Details

#<<(data_item) ⇒ `Object`

Add a data item to the data set

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 238

def <<(data_item)
  if data_item.nil? || !data_item.is_a?(Enumerable) || data_item.empty?
    raise ArgumentError, 'Data must not be an non empty array.'
  elsif @data_items.empty?
    set_data_items([data_item])
  elsif data_item.length != num_attributes
    raise ArgumentError, 'Number of attributes do not match. ' \
                         "#{data_item.length} attributes provided, " \
                         "#{num_attributes} attributes expected."
  else
    @data_items << data_item
  end
end

#[](index) ⇒ `Object`

Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.

Parameters:

index (Object)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 60

def [](index)
  selected_items = if index.is_a?(Integer)
                     [@data_items[index]]
                   else
                     @data_items[index]
                   end
  DataSet.new(data_items: selected_items,
              data_labels: @data_labels)
end

#build_domain(attr) ⇒ `Object`

Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).

Set instance containing all possible values for nominal attributes
Array with min and max values for numeric attributes (i.e. [min, max])

build_domain(“city”)

> #<Set: York”, “Chicago”>

build_domain(“age”)

> [5, 85]

build_domain(2) # In this example, the third attribute is gender

> #<Set: “F”>

Parameters:

attr (Object)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 205

def build_domain(attr)
  index = get_index(attr)
  return [Statistics.min(self, index), Statistics.max(self, index)] if @data_items.first[index].is_a?(Numeric)

  @data_items.inject(Set.new) { |domain, x| domain << x[index] }
end

#build_domains ⇒ `Object`

Returns an array with the domain of each attribute:

Set instance containing all possible values for nominal attributes
Array with min and max values for numeric attributes (i.e. [min, max])

Return example:

> [#<Set: York”, “Chicago”>,

#<Set: {"<30", "[30-50)", "[50-80]", ">80"}>,
#<Set: {"M", "F"}>,
[5, 85],
#<Set: {"Y", "N"}>]

Returns:

(Object)



186
187
188

# File 'lib/ai4r/data/data_set.rb', line 186

def build_domains
  @data_labels.collect { |attr_label| build_domain(attr_label) }
end

#category_label ⇒ `Object`

Returns label of category

Returns:

(Object)



339
340
341

# File 'lib/ai4r/data/data_set.rb', line 339

def category_label
  data_labels.last
end

#check_not_empty ⇒ `Object`

Raise an exception if there is no data item.

Returns:

(Object)

Raises:

(ArgumentError)

# File 'lib/ai4r/data/data_set.rb', line 230

def check_not_empty
  return unless @data_items.empty?

  raise ArgumentError, 'Examples data set must not be empty.'
end

#get_index(attr) ⇒ `Object`

Returns the index of a given attribute (0-based). For example, if “gender” is the third attribute, then:

get_index("gender")
=> 2

Parameters:

attr (Object)

Returns:

(Object)



224
225
226

# File 'lib/ai4r/data/data_set.rb', line 224

def get_index(attr)
  attr.is_a?(Integer) || attr.is_a?(Range) ? attr : @data_labels.index(attr)
end

#get_mean_or_mode ⇒ `Object`

Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 255

def get_mean_or_mode
  mean = []
  num_attributes.times do |i|
    mean[i] =
      if @data_items.first[i].is_a?(Numeric)
        Statistics.mean(self, i)
      else
        Statistics.mode(self, i)
      end
  end
  mean
end

#load_csv(filepath, parse_numeric: false) ⇒ `Object`

Load data items from csv file

Parameters:

filepath (Object)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 73

def load_csv(filepath, parse_numeric: false)
  if parse_numeric
    parse_csv(filepath)
  else
    items = []
    open_csv_file(filepath) do |entry|
      items << entry
    end
    set_data_items(items)
  end
end

#load_csv_with_labels(filepath, parse_numeric: false) ⇒ `Object`

Load data items from csv file. The first row is used as data labels.

Parameters:

filepath (Object)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 96

def load_csv_with_labels(filepath, parse_numeric: false)
  load_csv(filepath, parse_numeric: parse_numeric)
  @data_labels = @data_items.shift
  self
end

#normalize!(method = :zscore) ⇒ `Object`

Normalize numeric attributes in place. Supported methods are :zscore (default) and :minmax.

Parameters:

method (Object) (defaults to: :zscore)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 272

def normalize!(method = :zscore)
  numeric_indices = (0...num_attributes).select do |i|
    @data_items.first[i].is_a?(Numeric)
  end

  case method
  when :zscore
    means = numeric_indices.map { |i| Statistics.mean(self, i) }
    sds = numeric_indices.map { |i| Statistics.standard_deviation(self, i) }
    @data_items.each do |row|
      numeric_indices.each_with_index do |idx, j|
        sd = sds[j]
        row[idx] = sd.zero? ? 0 : (row[idx] - means[j]) / sd
      end
    end
  when :minmax
    mins = numeric_indices.map { |i| Statistics.min(self, i) }
    maxs = numeric_indices.map { |i| Statistics.max(self, i) }
    @data_items.each do |row|
      numeric_indices.each_with_index do |idx, j|
        range = maxs[j] - mins[j]
        row[idx] = range.zero? ? 0 : (row[idx] - mins[j]) / range.to_f
      end
    end
  else
    raise ArgumentError, "Unknown normalization method #{method}"
  end

  self
end

#num_attributes ⇒ `Object`

Returns attributes number, including class attribute

Returns:

(Object)



214
215
216

# File 'lib/ai4r/data/data_set.rb', line 214

def num_attributes
  @data_items.empty? ? 0 : @data_items.first.size
end

#open_csv_file(filepath) ⇒ `Object`

Open a CSV file and yield each row to the provided block.

Parameters:

filepath (Object)
block (Object)

Returns:

(Object)



89
90
91

# File 'lib/ai4r/data/data_set.rb', line 89

def open_csv_file(filepath, &)
  CSV.foreach(filepath, &)
end

#parse_csv(filepath) ⇒ `Object`

Same as load_csv, but it will try to convert cell contents as numbers.

Parameters:

filepath (Object)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 105

def parse_csv(filepath)
  items = []
  open_csv_file(filepath) do |row|
    items << row.collect do |x|
      number?(x) ? Float(x, exception: false) : x
    end
  end
  set_data_items(items)
end

#parse_csv_with_labels(filepath) ⇒ `Object`

Same as load_csv_with_labels, but it will try to convert cell contents as numbers.

Parameters:

filepath (Object)

Returns:

(Object)



118
119
120

# File 'lib/ai4r/data/data_set.rb', line 118

def parse_csv_with_labels(filepath)
  load_csv_with_labels(filepath, parse_numeric: true)
end

#set_data_items(items) ⇒ `Object`

Set the data items. M data items with N attributes must have the following format:

[   [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1,  CLASS_VAL1],
    [ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2,  CLASS_VAL2],
    ...
    [ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM],
]

e.g.

[   ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'F', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '[30-50)',  'M', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[30-50)',  'F', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['New York',  '[50-80]', 'M', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['Chicago',     '>80',      'F', 'Y']
   ]

This method returns the classifier (self), allowing method chaining.

Parameters:

items (Object)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 168

def set_data_items(items)
  check_data_items(items)
  @data_labels = default_data_labels(items) if @data_labels.empty?
  @data_items = items
  self
end

#set_data_labels(labels) ⇒ `Object`

Set data labels. Data labels must have the following format:

[ 'city', 'age_range', 'gender', 'marketing_target'  ]

If you do not provide labels for you data, the following labels will be created by default:

[ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value'  ]

Parameters:

labels (Object)

Returns:

(Object)

# File 'lib/ai4r/data/data_set.rb', line 131

def set_data_labels(labels)
  check_data_labels(labels)
  @data_labels = labels
  self
end

#shuffle!(seed: nil) ⇒ `DataSet`

Randomizes the order of data items in place. If a seed is provided, it is used to initialize the random number generator for deterministic shuffling.

data_set.shuffle!(seed: 123)

Parameters:

seed (Integer, nil) (defaults to: nil) —

Seed for the RNG

Returns:

(DataSet) —

self

# File 'lib/ai4r/data/data_set.rb', line 311

def shuffle!(seed: nil)
  rng = seed ? Random.new(seed) : Random.new
  @data_items.shuffle!(random: rng)
  self
end

#split(ratio:) ⇒ `Array<DataSet, DataSet>`

Split the dataset into two new DataSet instances using the given ratio for the first set.

train, test = data_set.split(ratio: 0.8)

Parameters:

ratio (Float) —

fraction of items to place in the first set

Returns:

(Array<DataSet, DataSet>) —

the two resulting datasets

Raises:

(ArgumentError)

# File 'lib/ai4r/data/data_set.rb', line 324

def split(ratio:)
  raise ArgumentError, 'ratio must be between 0 and 1' unless ratio.positive? && ratio < 1

  pivot = (ratio * @data_items.length).round
  first_items = @data_items[0...pivot].map(&:dup)
  second_items = @data_items[pivot..].map(&:dup)

  [
    DataSet.new(data_items: first_items, data_labels: @data_labels.dup),
    DataSet.new(data_items: second_items, data_labels: @data_labels.dup)
  ]
end

Class: Ai4r::Data::DataSet

Overview

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Object

Instance Attribute Details

#data_items ⇒ Object (readonly)

#data_labels ⇒ Object (readonly)

Class Method Details

.normalized(data_set, method: :zscore) ⇒ Object

Instance Method Details

#<<(data_item) ⇒ Object

#[](index) ⇒ Object

#build_domain(attr) ⇒ Object

> #<Set: York”, “Chicago”>

> [5, 85]

> #<Set: “F”>

#build_domains ⇒ Object

> [#<Set: York”, “Chicago”>,

#category_label ⇒ Object

#check_not_empty ⇒ Object

#get_index(attr) ⇒ Object

#get_mean_or_mode ⇒ Object

#load_csv(filepath, parse_numeric: false) ⇒ Object

#load_csv_with_labels(filepath, parse_numeric: false) ⇒ Object

#normalize!(method = :zscore) ⇒ Object

#num_attributes ⇒ Object

#open_csv_file(filepath) ⇒ Object

#parse_csv(filepath) ⇒ Object

#parse_csv_with_labels(filepath) ⇒ Object

#set_data_items(items) ⇒ Object

#set_data_labels(labels) ⇒ Object

#shuffle!(seed: nil) ⇒ DataSet

#split(ratio:) ⇒ Array<DataSet, DataSet>

#initialize(options = {}) ⇒ `Object`

#data_items ⇒ `Object` (readonly)

#data_labels ⇒ `Object` (readonly)

.normalized(data_set, method: :zscore) ⇒ `Object`

#<<(data_item) ⇒ `Object`

#[](index) ⇒ `Object`

#build_domain(attr) ⇒ `Object`

#build_domains ⇒ `Object`

#category_label ⇒ `Object`

#check_not_empty ⇒ `Object`

#get_index(attr) ⇒ `Object`

#get_mean_or_mode ⇒ `Object`

#load_csv(filepath, parse_numeric: false) ⇒ `Object`

#load_csv_with_labels(filepath, parse_numeric: false) ⇒ `Object`

#normalize!(method = :zscore) ⇒ `Object`

#num_attributes ⇒ `Object`

#open_csv_file(filepath) ⇒ `Object`

#parse_csv(filepath) ⇒ `Object`

#parse_csv_with_labels(filepath) ⇒ `Object`

#set_data_items(items) ⇒ `Object`

#set_data_labels(labels) ⇒ `Object`

#shuffle!(seed: nil) ⇒ `DataSet`

#split(ratio:) ⇒ `Array<DataSet, DataSet>`