Class: Ai4r::Data::DataSet

Inherits:
Object
  • Object
show all
Defined in:
lib/ai4r/data/data_set.rb

Overview

A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ DataSet

Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.

e.g. DataSet.new(:data_items => data_items, :data_labels => labels)

If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)



32
33
34
35
36
37
# File 'lib/ai4r/data/data_set.rb', line 32

def initialize(options = {})
  @data_labels = []
  @data_items = options[:data_items] || []
  set_data_labels(options[:data_labels]) if options[:data_labels]
  set_data_items(options[:data_items]) if options[:data_items]
end

Instance Attribute Details

#data_itemsObject (readonly)

Returns the value of attribute data_items.



23
24
25
# File 'lib/ai4r/data/data_set.rb', line 23

def data_items
  @data_items
end

#data_labelsObject (readonly)

Returns the value of attribute data_labels.



23
24
25
# File 'lib/ai4r/data/data_set.rb', line 23

def data_labels
  @data_labels
end

Instance Method Details

#<<(data_item) ⇒ Object

Add a data item to the data set



201
202
203
204
205
206
207
208
209
210
211
212
213
# File 'lib/ai4r/data/data_set.rb', line 201

def << data_item
  if data_item.nil? || !data_item.is_a?(Enumerable) || data_item.empty?
    raise ArgumentError, "Data must not be an non empty array."
  elsif @data_items.empty?
    set_data_items([data_item])
  elsif data_item.length != num_attributes
    raise ArgumentError, "Number of attributes do not match. " +
            "#{data_item.length} attributes provided, " +
            "#{num_attributes} attributes expected."
  else
    @data_items << data_item
  end
end

#[](index) ⇒ Object

Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.



41
42
43
44
45
46
# File 'lib/ai4r/data/data_set.rb', line 41

def [](index)
  selected_items = (index.is_a?(Fixnum)) ?
          [@data_items[index]] : @data_items[index]
  return DataSet.new(:data_items => selected_items,
                     :data_labels =>@data_labels)
end

#build_domain(attr) ⇒ Object

Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).

  • Set instance containing all possible values for nominal attributes

  • Array with min and max values for numeric attributes (i.e. [min, max])

    build_domain(“city”)

    > #<Set: York”, “Chicago”>

    build_domain(“age”)

    > [5, 85]

    build_domain(2) # In this example, the third attribute is gender

    > #<Set: “F”>



171
172
173
174
175
176
177
178
# File 'lib/ai4r/data/data_set.rb', line 171

def build_domain(attr)
  index = get_index(attr)
  if @data_items.first[index].is_a?(Numeric)
    return [Statistics.min(self, index), Statistics.max(self, index)]
  else
    return @data_items.inject(Set.new){|domain, x| domain << x[index]}
  end
end

#build_domainsObject

Returns an array with the domain of each attribute:

  • Set instance containing all possible values for nominal attributes

  • Array with min and max values for numeric attributes (i.e. [min, max])

Return example:

> [#<Set: York”, “Chicago”>,

#<Set: {"<30", "[30-50)", "[50-80]", ">80"}>, 
#<Set: {"M", "F"}>,
[5, 85], 
#<Set: {"Y", "N"}>]


154
155
156
# File 'lib/ai4r/data/data_set.rb', line 154

def build_domains
  @data_labels.collect {|attr_label| build_domain(attr_label) }
end

#check_not_emptyObject

Raise an exception if there is no data item.



194
195
196
197
198
# File 'lib/ai4r/data/data_set.rb', line 194

def check_not_empty
  if @data_items.empty?
    raise ArgumentError, "Examples data set must not be empty."
  end
end

#get_index(attr) ⇒ Object

Returns the index of a given attribute (0-based). For example, if “gender” is the third attribute, then:

get_index("gender") 
=> 2


189
190
191
# File 'lib/ai4r/data/data_set.rb', line 189

def get_index(attr)
  return (attr.is_a?(Fixnum) || attr.is_a?(Range)) ? attr : @data_labels.index(attr)
end

#get_mean_or_modeObject

Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes



217
218
219
220
221
222
223
224
225
226
227
228
# File 'lib/ai4r/data/data_set.rb', line 217

def get_mean_or_mode
  mean = []
  num_attributes.times do |i|
    mean[i] =
            if @data_items.first[i].is_a?(Numeric)
              Statistics.mean(self, i)
            else
              Statistics.mode(self, i)
            end
  end
  return mean
end

#load_csv(filepath) ⇒ Object

Load data items from csv file



49
50
51
52
53
54
55
# File 'lib/ai4r/data/data_set.rb', line 49

def load_csv(filepath)
  items = []
  open_csv_file(filepath) do |entry|
    items << entry
  end
  set_data_items(items)
end

#load_csv_with_labels(filepath) ⇒ Object

Load data items from csv file. The first row is used as data labels.



73
74
75
76
77
# File 'lib/ai4r/data/data_set.rb', line 73

def load_csv_with_labels(filepath)
  load_csv(filepath)
  @data_labels = @data_items.shift
  return self
end

#num_attributesObject

Returns attributes number, including class attribute



181
182
183
# File 'lib/ai4r/data/data_set.rb', line 181

def num_attributes
  return (@data_items.empty?) ? 0 : @data_items.first.size
end

#open_csv_file(filepath, &block) ⇒ Object

opens a csv-file and reads it line by line for each line, a block is called and the row is passed to the block ruby1.8 and 1.9 safe



60
61
62
63
64
65
66
67
68
69
70
# File 'lib/ai4r/data/data_set.rb', line 60

def open_csv_file(filepath, &block)
  if CSV.const_defined? :Reader
    CSV::Reader.parse(File.open(filepath, 'r')) do |row|
      block.call row
    end
  else
    CSV.parse(File.open(filepath, 'r')) do |row|
      block.call row
    end
  end
end

#parse_csv(filepath) ⇒ Object

Same as load_csv, but it will try to convert cell contents as numbers.



80
81
82
83
84
85
86
# File 'lib/ai4r/data/data_set.rb', line 80

def parse_csv(filepath)
  items = []
  open_csv_file(filepath) do |row|
    items << row.collect{|x| is_number?(x) ? Float(x) : x }
  end
  set_data_items(items)
end

#parse_csv_with_labels(filepath) ⇒ Object

Same as load_csv_with_labels, but it will try to convert cell contents as numbers.



89
90
91
92
93
# File 'lib/ai4r/data/data_set.rb', line 89

def parse_csv_with_labels(filepath)
  parse_csv(filepath)
  @data_labels = @data_items.shift
  return self
end

#set_data_items(items) ⇒ Object

Set the data items. M data items with N attributes must have the following format:

[   [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1,  CLASS_VAL1], 
    [ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2,  CLASS_VAL2], 
    ...
    [ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM], 
]

e.g.

[   ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'F', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '[30-50)',  'M', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[30-50)',  'F', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['New York',  '[50-80]', 'M', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['Chicago',     '>80',      'F', 'Y']
   ]

This method returns the classifier (self), allowing method chaining.



137
138
139
140
141
142
# File 'lib/ai4r/data/data_set.rb', line 137

def set_data_items(items)
  check_data_items(items)
  @data_labels = default_data_labels(items) if @data_labels.empty?
  @data_items = items
  return self
end

#set_data_labels(labels) ⇒ Object

Set data labels. Data labels must have the following format:

[ 'city', 'age_range', 'gender', 'marketing_target'  ]

If you do not provide labels for you data, the following labels will be created by default:

[ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value'  ]


102
103
104
105
106
# File 'lib/ai4r/data/data_set.rb', line 102

def set_data_labels(labels)
  check_data_labels(labels)
  @data_labels = labels
  return self
end