Class: Ai4r::Data::DataSet
- Inherits:
-
Object
- Object
- Ai4r::Data::DataSet
- Defined in:
- lib/ai4r/data/data_set.rb
Overview
A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.
Instance Attribute Summary collapse
-
#data_items ⇒ Object
readonly
Returns the value of attribute data_items.
-
#data_labels ⇒ Object
readonly
Returns the value of attribute data_labels.
Class Method Summary collapse
-
.normalized(data_set, method: :zscore) ⇒ Object
Return a new DataSet with numeric attributes normalized.
Instance Method Summary collapse
-
#<<(data_item) ⇒ Object
Add a data item to the data set.
-
#[](index) ⇒ Object
Retrieve a new DataSet, with the item(s) selected by the provided index.
-
#build_domain(attr) ⇒ Object
Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).
-
#build_domains ⇒ Object
Returns an array with the domain of each attribute: * Set instance containing all possible values for nominal attributes * Array with min and max values for numeric attributes (i.e. [min, max]).
-
#category_label ⇒ Object
Returns label of category.
-
#check_not_empty ⇒ Object
Raise an exception if there is no data item.
-
#get_index(attr) ⇒ Object
Returns the index of a given attribute (0-based).
-
#get_mean_or_mode ⇒ Object
Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes.
-
#initialize(options = {}) ⇒ Object
constructor
Create a new DataSet.
-
#load_csv(filepath, parse_numeric: false) ⇒ Object
Load data items from csv file.
-
#load_csv_with_labels(filepath, parse_numeric: false) ⇒ Object
Load data items from csv file.
-
#normalize!(method = :zscore) ⇒ Object
Normalize numeric attributes in place.
-
#num_attributes ⇒ Object
Returns attributes number, including class attribute.
-
#open_csv_file(filepath) ⇒ Object
Open a CSV file and yield each row to the provided block.
-
#parse_csv(filepath) ⇒ Object
Same as load_csv, but it will try to convert cell contents as numbers.
-
#parse_csv_with_labels(filepath) ⇒ Object
Same as load_csv_with_labels, but it will try to convert cell contents as numbers.
-
#set_data_items(items) ⇒ Object
Set the data items.
-
#set_data_labels(labels) ⇒ Object
Set data labels.
-
#shuffle!(seed: nil) ⇒ DataSet
Randomizes the order of data items in place.
-
#split(ratio:) ⇒ Array<DataSet, DataSet>
Split the dataset into two new DataSet instances using the given ratio for the first set.
Constructor Details
#initialize(options = {}) ⇒ Object
Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.
e.g. DataSet.new(:data_items => data_items, :data_labels => labels)
If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)
49 50 51 52 53 54 |
# File 'lib/ai4r/data/data_set.rb', line 49 def initialize( = {}) @data_labels = [] @data_items = [:data_items] || [] set_data_labels([:data_labels]) if [:data_labels] set_data_items([:data_items]) if [:data_items] end |
Instance Attribute Details
#data_items ⇒ Object (readonly)
Returns the value of attribute data_items.
23 24 25 |
# File 'lib/ai4r/data/data_set.rb', line 23 def data_items @data_items end |
#data_labels ⇒ Object (readonly)
Returns the value of attribute data_labels.
23 24 25 |
# File 'lib/ai4r/data/data_set.rb', line 23 def data_labels @data_labels end |
Class Method Details
.normalized(data_set, method: :zscore) ⇒ Object
Return a new DataSet with numeric attributes normalized. Available methods are:
-
:zscore- subtract the mean and divide by the standard deviation -
:minmax- scale values to the [0,1] range
32 33 34 35 36 37 38 |
# File 'lib/ai4r/data/data_set.rb', line 32 def self.normalized(data_set, method: :zscore) new_set = DataSet.new( data_items: data_set.data_items.map(&:dup), data_labels: data_set.data_labels.dup ) new_set.normalize!(method) end |
Instance Method Details
#<<(data_item) ⇒ Object
Add a data item to the data set
238 239 240 241 242 243 244 245 246 247 248 249 250 |
# File 'lib/ai4r/data/data_set.rb', line 238 def <<(data_item) if data_item.nil? || !data_item.is_a?(Enumerable) || data_item.empty? raise ArgumentError, 'Data must not be an non empty array.' elsif @data_items.empty? set_data_items([data_item]) elsif data_item.length != num_attributes raise ArgumentError, 'Number of attributes do not match. ' \ "#{data_item.length} attributes provided, " \ "#{num_attributes} attributes expected." else @data_items << data_item end end |
#[](index) ⇒ Object
Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.
60 61 62 63 64 65 66 67 68 |
# File 'lib/ai4r/data/data_set.rb', line 60 def [](index) selected_items = if index.is_a?(Integer) [@data_items[index]] else @data_items[index] end DataSet.new(data_items: selected_items, data_labels: @data_labels) end |
#build_domain(attr) ⇒ Object
Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).
-
Set instance containing all possible values for nominal attributes
-
Array with min and max values for numeric attributes (i.e. [min, max])
build_domain(“city”)
> #<Set: York”, “Chicago”>
build_domain(“age”)
> [5, 85]
build_domain(2) # In this example, the third attribute is gender
> #<Set: “F”>
205 206 207 208 209 210 |
# File 'lib/ai4r/data/data_set.rb', line 205 def build_domain(attr) index = get_index(attr) return [Statistics.min(self, index), Statistics.max(self, index)] if @data_items.first[index].is_a?(Numeric) @data_items.inject(Set.new) { |domain, x| domain << x[index] } end |
#build_domains ⇒ Object
Returns an array with the domain of each attribute:
-
Set instance containing all possible values for nominal attributes
-
Array with min and max values for numeric attributes (i.e. [min, max])
Return example:
> [#<Set: York”, “Chicago”>,
#<Set: {"<30", "[30-50)", "[50-80]", ">80"}>,
#<Set: {"M", "F"}>,
[5, 85],
#<Set: {"Y", "N"}>]
186 187 188 |
# File 'lib/ai4r/data/data_set.rb', line 186 def build_domains @data_labels.collect { |attr_label| build_domain(attr_label) } end |
#category_label ⇒ Object
Returns label of category
339 340 341 |
# File 'lib/ai4r/data/data_set.rb', line 339 def category_label data_labels.last end |
#check_not_empty ⇒ Object
Raise an exception if there is no data item.
230 231 232 233 234 |
# File 'lib/ai4r/data/data_set.rb', line 230 def check_not_empty return unless @data_items.empty? raise ArgumentError, 'Examples data set must not be empty.' end |
#get_index(attr) ⇒ Object
Returns the index of a given attribute (0-based). For example, if “gender” is the third attribute, then:
get_index("gender")
=> 2
224 225 226 |
# File 'lib/ai4r/data/data_set.rb', line 224 def get_index(attr) attr.is_a?(Integer) || attr.is_a?(Range) ? attr : @data_labels.index(attr) end |
#get_mean_or_mode ⇒ Object
Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes
255 256 257 258 259 260 261 262 263 264 265 266 |
# File 'lib/ai4r/data/data_set.rb', line 255 def get_mean_or_mode mean = [] num_attributes.times do |i| mean[i] = if @data_items.first[i].is_a?(Numeric) Statistics.mean(self, i) else Statistics.mode(self, i) end end mean end |
#load_csv(filepath, parse_numeric: false) ⇒ Object
Load data items from csv file
73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/ai4r/data/data_set.rb', line 73 def load_csv(filepath, parse_numeric: false) if parse_numeric parse_csv(filepath) else items = [] open_csv_file(filepath) do |entry| items << entry end set_data_items(items) end end |
#load_csv_with_labels(filepath, parse_numeric: false) ⇒ Object
Load data items from csv file. The first row is used as data labels.
96 97 98 99 100 |
# File 'lib/ai4r/data/data_set.rb', line 96 def load_csv_with_labels(filepath, parse_numeric: false) load_csv(filepath, parse_numeric: parse_numeric) @data_labels = @data_items.shift self end |
#normalize!(method = :zscore) ⇒ Object
Normalize numeric attributes in place. Supported methods are :zscore (default) and :minmax.
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
# File 'lib/ai4r/data/data_set.rb', line 272 def normalize!(method = :zscore) numeric_indices = (0...num_attributes).select do |i| @data_items.first[i].is_a?(Numeric) end case method when :zscore means = numeric_indices.map { |i| Statistics.mean(self, i) } sds = numeric_indices.map { |i| Statistics.standard_deviation(self, i) } @data_items.each do |row| numeric_indices.each_with_index do |idx, j| sd = sds[j] row[idx] = sd.zero? ? 0 : (row[idx] - means[j]) / sd end end when :minmax mins = numeric_indices.map { |i| Statistics.min(self, i) } maxs = numeric_indices.map { |i| Statistics.max(self, i) } @data_items.each do |row| numeric_indices.each_with_index do |idx, j| range = maxs[j] - mins[j] row[idx] = range.zero? ? 0 : (row[idx] - mins[j]) / range.to_f end end else raise ArgumentError, "Unknown normalization method #{method}" end self end |
#num_attributes ⇒ Object
Returns attributes number, including class attribute
214 215 216 |
# File 'lib/ai4r/data/data_set.rb', line 214 def num_attributes @data_items.empty? ? 0 : @data_items.first.size end |
#open_csv_file(filepath) ⇒ Object
Open a CSV file and yield each row to the provided block.
89 90 91 |
# File 'lib/ai4r/data/data_set.rb', line 89 def open_csv_file(filepath, &) CSV.foreach(filepath, &) end |
#parse_csv(filepath) ⇒ Object
Same as load_csv, but it will try to convert cell contents as numbers.
105 106 107 108 109 110 111 112 113 |
# File 'lib/ai4r/data/data_set.rb', line 105 def parse_csv(filepath) items = [] open_csv_file(filepath) do |row| items << row.collect do |x| number?(x) ? Float(x, exception: false) : x end end set_data_items(items) end |
#parse_csv_with_labels(filepath) ⇒ Object
Same as load_csv_with_labels, but it will try to convert cell contents as numbers.
118 119 120 |
# File 'lib/ai4r/data/data_set.rb', line 118 def parse_csv_with_labels(filepath) load_csv_with_labels(filepath, parse_numeric: true) end |
#set_data_items(items) ⇒ Object
Set the data items. M data items with N attributes must have the following format:
[ [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1, CLASS_VAL1],
[ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2, CLASS_VAL2],
...
[ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM],
]
e.g.
[ ['New York', '<30', 'M', 'Y'],
['Chicago', '<30', 'M', 'Y'],
['Chicago', '<30', 'F', 'Y'],
['New York', '<30', 'M', 'Y'],
['New York', '<30', 'M', 'Y'],
['Chicago', '[30-50)', 'M', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[30-50)', 'F', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['New York', '[50-80]', 'M', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['Chicago', '>80', 'F', 'Y']
]
This method returns the classifier (self), allowing method chaining.
168 169 170 171 172 173 |
# File 'lib/ai4r/data/data_set.rb', line 168 def set_data_items(items) check_data_items(items) @data_labels = default_data_labels(items) if @data_labels.empty? @data_items = items self end |
#set_data_labels(labels) ⇒ Object
Set data labels. Data labels must have the following format:
[ 'city', 'age_range', 'gender', 'marketing_target' ]
If you do not provide labels for you data, the following labels will be created by default:
[ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value' ]
131 132 133 134 135 |
# File 'lib/ai4r/data/data_set.rb', line 131 def set_data_labels(labels) check_data_labels(labels) @data_labels = labels self end |
#shuffle!(seed: nil) ⇒ DataSet
Randomizes the order of data items in place. If a seed is provided, it is used to initialize the random number generator for deterministic shuffling.
data_set.shuffle!(seed: 123)
311 312 313 314 315 |
# File 'lib/ai4r/data/data_set.rb', line 311 def shuffle!(seed: nil) rng = seed ? Random.new(seed) : Random.new @data_items.shuffle!(random: rng) self end |
#split(ratio:) ⇒ Array<DataSet, DataSet>
Split the dataset into two new DataSet instances using the given ratio for the first set.
train, test = data_set.split(ratio: 0.8)
324 325 326 327 328 329 330 331 332 333 334 335 |
# File 'lib/ai4r/data/data_set.rb', line 324 def split(ratio:) raise ArgumentError, 'ratio must be between 0 and 1' unless ratio.positive? && ratio < 1 pivot = (ratio * @data_items.length).round first_items = @data_items[0...pivot].map(&:dup) second_items = @data_items[pivot..].map(&:dup) [ DataSet.new(data_items: first_items, data_labels: @data_labels.dup), DataSet.new(data_items: second_items, data_labels: @data_labels.dup) ] end |