Class: Statsample::Dataset

Inherits:
Object
  • Object
show all
Includes:
Writable
Defined in:
lib/statsample/dataset.rb

Overview

Set of cases with values for one or more variables, analog to a dataframe on R or a standard data file of SPSS. Every vector has #field name, which represent it. By default, the vectors are ordered by it field name, but you can change it the fields order manually. The Dataset work as a Hash, with keys are field names and values are Statsample::Vector

Usage

Create a empty dataset

Dataset.new()

Create a dataset with three empty vectors, called v1, v2 and v3

Dataset.new(%w{v1 v2 v3})

Create a dataset with two vectors

Dataset.new({'v1'=>%w{1 2 3}.to_vector, 'v2'=>%w{4 5 6}.to_vector})

Create a dataset with two given vectors (v1 and v2), with vectors on inverted order

Dataset.new({'v2'=>v2,'v1'=>v1},['v2','v1'])

The fast way to create a dataset uses Hash#to_dataset, with field order as arguments

v1 = [1,2,3].to_scale
v2 = [1,2,3].to_scale
ds = {'v1'=>v2, 'v2'=>v2}.to_dataset(%w{v2 v1})

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Writable

#save

Constructor Details

#initialize(vectors = {}, fields = [], labels = {}) ⇒ Dataset

Returns a new instance of Dataset.



128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/statsample/dataset.rb', line 128

def initialize(vectors={}, fields=[], labels={})
  if vectors.instance_of? Array
    @fields=vectors.dup
    @vectors=vectors.inject({}){|a,x| a[x]=Statsample::Vector.new(); a}
  else
    # Check vectors
    @vectors=vectors
    @fields=fields
    check_order
    check_length
  end
  @i=nil
  @labels=labels
end

Instance Attribute Details

#casesObject (readonly)

Number of cases



64
65
66
# File 'lib/statsample/dataset.rb', line 64

def cases
  @cases
end

#fieldsObject

Ordered names of vectors



62
63
64
# File 'lib/statsample/dataset.rb', line 62

def fields
  @fields
end

#iObject (readonly)

Location of pointer on enumerations methods (like #each)



66
67
68
# File 'lib/statsample/dataset.rb', line 66

def i
  @i
end

#labelsObject

Deprecated: Label of vectors



68
69
70
# File 'lib/statsample/dataset.rb', line 68

def labels
  @labels
end

#vectorsObject (readonly)

Hash of Statsample::Vector



60
61
62
# File 'lib/statsample/dataset.rb', line 60

def vectors
  @vectors
end

Class Method Details

.crosstab_by_asignation(rows, columns, values) ⇒ Object

Generates a new dataset, using three vectors

  • Rows

  • Columns

  • Values

For example, you have these values

x   y   v
a   a   0
a   b   1
b   a   1
b   b   0

You obtain

id  a   b
 a  0   1
 b  1   0

Useful to process outputs from databases



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# File 'lib/statsample/dataset.rb', line 90

def self.crosstab_by_asignation(rows,columns,values)
  raise "Three vectors should be equal size" if rows.size!=columns.size or rows.size!=values.size
  cols_values=columns.factors
  cols_n=cols_values.size
  h_rows=rows.factors.inject({}){|a,v| a[v]=cols_values.inject({}){
    |a1,v1| a1[v1]=nil; a1
    }
    ;a}
  values.each_index{|i|
    h_rows[rows[i]][columns[i]]=values[i]
  }      
  ds=Dataset.new(["_id"]+cols_values)
  cols_values.each{|c|
    ds[c].type=values.type
  }
  rows.factors.each {|row|
    n_row=Array.new(cols_n+1)
    n_row[0]=row
      cols_values.each_index {|i|
        n_row[i+1]=h_rows[row][cols_values[i]]
    }
    ds.add_case_array(n_row)
  }
  ds.update_valid_data
  ds
end

Instance Method Details

#==(d2) ⇒ Object

We have the same datasets if the labels and vectors are the same



237
238
239
# File 'lib/statsample/dataset.rb', line 237

def ==(d2)
  @vectors==d2.vectors and @fields==d2.fields
end

#[](i) ⇒ Object

Returns the vector named i



507
508
509
510
511
512
513
514
515
516
517
518
# File 'lib/statsample/dataset.rb', line 507

def[](i)
  if i.is_a? String
    raise Exception,"Vector '#{i}' doesn't exists on dataset" unless @vectors.has_key?(i)
    @vectors[i]
  elsif i.is_a? Range
    fields=from_to(i.begin,i.end)
    vectors=fields.inject({}) {|a,v| a[v]=@vectors[v];a}
    ds=Dataset.new(vectors,fields)
  else
    raise ArgumentError, "You need a String or a Range"
  end
end

#[]=(i, v) ⇒ Object



548
549
550
551
552
553
554
555
# File 'lib/statsample/dataset.rb', line 548

def[]=(i,v)
  if v.instance_of? Statsample::Vector
    @vectors[i]=v
    check_order
  else
    raise ArgumentError,"Should pass a Statsample::Vector"
  end
end

#_case_as_array(c) ⇒ Object

:nodoc:



440
441
442
# File 'lib/statsample/dataset.rb', line 440

def _case_as_array(c) # :nodoc:
  @fields.collect {|x| @vectors[x][c]}
end

#_case_as_hash(c) ⇒ Object

:nodoc:



437
438
439
# File 'lib/statsample/dataset.rb', line 437

def _case_as_hash(c) # :nodoc:
  @fields.inject({}) {|a,x| a[x]=@vectors[x][c];a }
end

#add_case(v, uvd = true) ⇒ Object

Insert a case, using:

  • Array: size equal to number of vectors and values in the same order as fields

  • Hash: keys equal to fields

If uvd is false, #update_valid_data is not executed after inserting a case. This is very useful if you want to increase the performance on inserting many cases, because #update_valid_data performs check on vectors and on the dataset



277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# File 'lib/statsample/dataset.rb', line 277

def add_case(v,uvd=true)
  case v
  when Array
    if (v[0].is_a? Array)
      v.each{|subv| add_case(subv,false)}
    else
      raise ArgumentError, "Input array size (#{v.size}) should be equal to fields number (#{@fields.size})" if @fields.size!=v.size
      v.each_index {|i| @vectors[@fields[i]].add(v[i],false)}
    end
  when Hash
    raise ArgumentError, "Hash keys should be equal to fields #{(v.keys - @fields).join(",")}" if @fields.sort!=v.keys.sort
    @fields.each{|f| @vectors[f].add(v[f],false)}
  else
    raise TypeError, 'Value must be a Array or a Hash'
  end
  if uvd
    update_valid_data
  end
end

#add_case_array(v) ⇒ Object

Fast version of #add_case. Can only add one case and no error check if performed You SHOULD use #update_valid_data at the end of insertion cycle



266
267
268
# File 'lib/statsample/dataset.rb', line 266

def add_case_array(v)
  v.each_index {|i| d=@vectors[@fields[i]].data; d.push(v[i])}
end

#add_vector(name, vector) ⇒ Object

Raises:

  • (ArgumentError)


244
245
246
247
248
# File 'lib/statsample/dataset.rb', line 244

def add_vector(name,vector)
  raise ArgumentError, "Vector have different size" if vector.size!=@cases
  @vectors[name]=vector
  check_order
end

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object



318
319
320
321
322
323
# File 'lib/statsample/dataset.rb', line 318

def add_vectors_by_split(name,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name].split_by_separator(sep)
  split.each{|k,v|
    add_vector(name+join+k,v)
  }
end

#add_vectors_by_split_recode(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object



308
309
310
311
312
313
314
315
316
317
# File 'lib/statsample/dataset.rb', line 308

def add_vectors_by_split_recode(name,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name].split_by_separator(sep)
  i=1
  split.each{|k,v|
    new_field=name+join+i.to_s
    @labels[new_field]=name+":"+k
    add_vector(new_field,v)
    i+=1
  }
end

#as_rObject



794
795
796
797
798
# File 'lib/statsample/dataset.rb', line 794

def as_r
  require 'rsruby/dataframe'
  r=RSRuby.instance

end

#bootstrap(n = nil) ⇒ Object

Creates a dataset with the random data, of a n size If n not given, uses original number of cases



254
255
256
257
258
259
260
261
262
# File 'lib/statsample/dataset.rb', line 254

def bootstrap(n=nil)
  n||=@cases
  ds_boot=dup_empty
  for i in 1..n
    ds_boot.add_case_array(case_as_array(rand(n)))
  end
  ds_boot.update_valid_data
  ds_boot
end

#case_as_array(i) ⇒ Object

Retrieves case i as a array, ordered on #fields order



428
429
430
# File 'lib/statsample/dataset.rb', line 428

def case_as_array(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_array(self,c)
end

#case_as_hash(i) ⇒ Object

Retrieves case i as a hash



417
418
419
# File 'lib/statsample/dataset.rb', line 417

def case_as_hash(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_hash(self,c)
end

#check_fields(fields) ⇒ Object

Check if #fields attribute is correct, after inserting or deleting vectors



345
346
347
348
349
# File 'lib/statsample/dataset.rb', line 345

def check_fields(fields)
  fields||=@fields
  raise "Fields #{(fields-@fields).join(", ")} doesn't exists on dataset" if (fields-@fields).size>0
  fields
end

#check_lengthObject

Check vectors for type and size.



396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
# File 'lib/statsample/dataset.rb', line 396

def check_length # :nodoc:
  size=nil
  @vectors.each do |k,v|
    raise Exception, "Data #{v.class} is not a vector on key #{k}" if !v.is_a? Statsample::Vector
    if size.nil?
      size=v.size
    else
      if v.size!=size
        p v.to_a.size
        raise Exception, "Vector #{k} have size #{v.size} and dataset have size #{size}"
      end
    end
  end
  @cases=size
end

#check_orderObject



500
501
502
503
504
505
# File 'lib/statsample/dataset.rb', line 500

def check_order
  if(@vectors.keys.sort!=@fields.sort)
    @fields=@fields&@vectors.keys
    @fields+=@vectors.keys.sort-@fields
  end
end

#col(c) ⇒ Object Also known as: vector



240
241
242
# File 'lib/statsample/dataset.rb', line 240

def col(c)
  @vectors[c]
end

#collect(type = :scale) ⇒ Object

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.



521
522
523
524
525
526
527
# File 'lib/statsample/dataset.rb', line 521

def collect(type=:scale)
  data=[]
  each {|row|
    data.push yield(row)
  }
  Statsample::Vector.new(data,type)
end

#collect_matrixObject

Generate a matrix, based on fields of dataset



228
229
230
231
232
233
234
235
# File 'lib/statsample/dataset.rb', line 228

def collect_matrix
  rows=@fields.collect{|row|
    @fields.collect{|col|
      yield row,col
    }
  }
  Matrix.rows(rows)
end

#collect_with_index(type = :scale) ⇒ Object

Same as #collect, but giving case index as second parameter on yield.



529
530
531
532
533
534
535
# File 'lib/statsample/dataset.rb', line 529

def collect_with_index(type=:scale)
  data=[]
  each_with_index {|row, i|
    data.push(yield(row, i))
  }
  Statsample::Vector.new(data,type)
end

#compute(text) ⇒ Object

Returns a vector, based on a string with a calculation based on vector The calculation will be eval’ed, so you can put any variable or expression valid on ruby For example:

a=[1,2].to_vector(scale)
b=[3,4].to_vector(scale)
ds={'a'=>a,'b'=>b}.to_dataset
ds.compute("a+b")
=> Vector [4,6]


655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
# File 'lib/statsample/dataset.rb', line 655

def compute(text)
  @fields.each{|f|
    if @vectors[f].type=:scale
      text.gsub!(f,"row['#{f}'].to_f")
    else
      text.gsub!(f,"row['#{f}']")
    end
  }
  collect_with_index {|row, i|
    invalid=false
    @fields.each{|f|
      if @vectors[f].data_with_nils[i].nil?
        invalid=true
      end
    }
    if invalid
      nil
    else
      eval(text)
    end
  }
end

#crosstab(v1, v2, opts = {}) ⇒ Object



545
546
547
# File 'lib/statsample/dataset.rb', line 545

def crosstab(v1,v2,opts={})
  Statsample::Crosstab.new(@vectors[v1], @vectors[v2],opts)
end

#delete_vector(name) ⇒ Object

Delete a vector



303
304
305
306
# File 'lib/statsample/dataset.rb', line 303

def delete_vector(name)
  @fields.delete(name)
  @vectors.delete(name)
end

#dup(*fields_to_include) ⇒ Object

Returns a duplicate of the Database If fields given, only include those vectors



176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# File 'lib/statsample/dataset.rb', line 176

def dup(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  vectors={}
  fields=[]
  new_labels={}
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    vectors[f]=@vectors[f].dup
    new_labels[f]=@labels[f]
    fields.push(f)
  }
  Dataset.new(vectors,fields,new_labels)
end

#dup_emptyObject

Creates a copy of the given dataset, without data on vectors



193
194
195
196
197
198
199
# File 'lib/statsample/dataset.rb', line 193

def dup_empty
  vectors=@vectors.inject({}) {|a,v|
    a[v[0]]=v[1].dup_empty
    a
  }
  Dataset.new(vectors,@fields.dup,@labels.dup)
end

#dup_only_validObject

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors



156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/statsample/dataset.rb', line 156

def dup_only_valid
  if @vectors.any?{|field,vector| vector.has_missing_data?}
    ds=dup_empty
    each_array { |c|
      ds.add_case_array(c) unless @fields.find{|f| @vectors[f].data_with_nils[@i].nil? }
    }
    ds.update_valid_data
  else
    ds=dup()
  end
  ds
end

#eachObject

Returns each case as a hash



445
446
447
448
449
450
451
452
453
454
455
456
457
# File 'lib/statsample/dataset.rb', line 445

def each
  begin
    @i=0
    @cases.times {|i|
      @i=i
      row=case_as_hash(i)
      yield row
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#each_arrayObject

Returns each case as an array



487
488
489
490
491
492
493
494
# File 'lib/statsample/dataset.rb', line 487

def each_array
  @cases.times {|i|
    @i=i
    row=case_as_array(i)
    yield row
  }
  @i=nil
end

#each_array_with_nilsObject

Returns each case as an array, coding missing values as nils



473
474
475
476
477
478
479
480
481
482
483
484
485
# File 'lib/statsample/dataset.rb', line 473

def each_array_with_nils
  m=fields.size
  @cases.times {|i|
    @i=i
    row=Array.new(m)
    fields.each_index{|j|
      f=fields[j]
      row[j]=@vectors[f].data_with_nils[i]
    }
    yield row
  }
  @i=nil
end

#each_vectorObject

Retrieves each vector as [key, vector]



412
413
414
# File 'lib/statsample/dataset.rb', line 412

def each_vector # :yield: |key, vector|
  @fields.each{|k| yield k, @vectors[k]}
end

#each_with_indexObject

Returns each case as hash and index



459
460
461
462
463
464
465
466
467
468
469
470
471
# File 'lib/statsample/dataset.rb', line 459

def each_with_index # :yield: |case, i|
  begin
    @i=0
    @cases.times{|i|
      @i=i
      row=case_as_hash(i)
      yield row, i
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#filterObject

Create a new dataset with all cases which the block returns true



586
587
588
589
590
591
592
593
# File 'lib/statsample/dataset.rb', line 586

def filter
  ds=self.dup_empty
  each {|c|
    ds.add_case(c,false) if yield c
  }
  ds.update_valid_data
  ds
end

#filter_field(field) ⇒ Object

creates a new vector with the data of a given field which the block returns true



596
597
598
599
600
601
602
# File 'lib/statsample/dataset.rb', line 596

def filter_field(field)
  a=[]
  each {|c|
    a.push(c[field]) if yield c
  }
  a.to_vector(@vectors[field].type)
end

#from_to(from, to) ⇒ Object

Returns an array with the fields from first argumen to last argument

Raises:

  • (ArgumentError)


169
170
171
172
173
# File 'lib/statsample/dataset.rb', line 169

def from_to(from,to)
  raise ArgumentError, "Field #{from} should be on dataset" if !@fields.include? from
  raise ArgumentError, "Field #{to} should be on dataset" if !@fields.include? to
  @fields.slice(@fields.index(from)..@fields.index(to))
end

#has_vector?(v) ⇒ Boolean

Returns:

  • (Boolean)


249
250
251
# File 'lib/statsample/dataset.rb', line 249

def has_vector? (v)
  return @vectors.has_key?(v)
end

#inspectObject



707
708
709
# File 'lib/statsample/dataset.rb', line 707

def inspect
  self.to_s
end

#label(v_id) ⇒ Object

Retrieves label for a vector, giving a field name.



150
151
152
153
# File 'lib/statsample/dataset.rb', line 150

def label(v_id) 
  raise "Vector #{v} doesn't exists" unless @fields.include? v_id
  @labels[v_id].nil? ? v_id : @labels[v_id]
end

#merge(other_ds) ⇒ Object

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…



203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
# File 'lib/statsample/dataset.rb', line 203

def merge(other_ds)
  raise "Cases should be equal (this:#{@cases}; other:#{other_ds.cases}" unless @cases==other_ds.cases
  types = @fields.collect{|f| @vectors[f].type} + other_ds.fields.collect{|f| other_ds[f].type}
  new_fields = (@fields+other_ds.fields).recode_repeated
  ds_new=Statsample::Dataset.new(new_fields)
  new_fields.each_index{|i|
    field=new_fields[i]
    ds_new[field].type=types[i]
  }
  @cases.times {|i|
    row=case_as_array(i)+other_ds.case_as_array(i)
    ds_new.add_case_array(row)
  }
  ds_new.update_valid_data
  ds_new
end

#one_to_many(parent_fields, pattern) ⇒ Object

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.

for example, you have a survey for number of children with this structure:

id, name, child_name_1, child_age_1, child_name_2, child_age_2

with

ds.one_to_many(%w{id}, "child_%v_%n"

the field of first parameters will be copied verbatim to new dataset, and fields which responds to second pattern will be added one case for each different %n. For example

cases=[
  ['1','george','red',10,'blue',20,nil,nil],
  ['2','fred','green',15,'orange',30,'white',20],
  ['3','alfred',nil,nil,nil,nil,nil,nil]
]
ds=Statsample::Dataset.new(%w{id name car_color1 car_value1 car_color2 car_value2 car_color3 car_value3})
cases.each {|c| ds.add_case_array c }
ds.one_to_many(['id'],'car_%v%n').to_matrix
=> Matrix[
   ["red", "1", 10], 
   ["blue", "1", 20],
   ["green", "2", 15],
   ["orange", "2", 30],
   ["white", "2", 20]
   ]


738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
# File 'lib/statsample/dataset.rb', line 738

def one_to_many(parent_fields, pattern)
  base_pattern=pattern.gsub(/%v|%n/,"")
  re=Regexp.new pattern.gsub("%v","(.+?)").gsub("%n","(\\d+?)")
  ds_vars=parent_fields
  vars=[]
  max_n=0
  h=parent_fields.inject({}) {|a,v| a[v]=Statsample::Vector.new([], @vectors[v].type);a }
  # Adding _row_id
  h['_col_id']=[].to_scale
  ds_vars.push("_col_id")
  @fields.each do |f|
    if f=~re
      if !vars.include? $1
        vars.push($1) 
        h[$1]=Statsample::Vector.new([], @vectors[f].type)
      end
      max_n=$2.to_i if max_n < $2.to_i
    end
  end
  ds=Dataset.new(h,ds_vars+vars)
  each do |row|
    row_out={}
    parent_fields.each do |f|
      row_out[f]=row[f]
    end
    max_n.times do |n1|
      n=n1+1
      any_data=false
      vars.each do |v|
        data=row[pattern.gsub("%v",v.to_s).gsub("%n",n.to_s)]
        row_out[v]=data
        any_data=true if !data.nil?
      end
      if any_data
        row_out["_col_id"]=n
        ds.add_case(row_out,false)
      end
      
    end
  end
  ds.update_valid_data
  ds
end

#recode!(vector_name) ⇒ Object

Recode a vector based on a block



537
538
539
540
541
542
543
# File 'lib/statsample/dataset.rb', line 537

def recode!(vector_name)
  
  0.upto(@cases-1) {|i|
    @vectors[vector_name].data[i]=yield case_as_hash(i)
  }
  @vectors[vector_name].set_valid_data
end

#standarizeObject

Returns a dataset with standarized data



220
221
222
223
224
225
226
# File 'lib/statsample/dataset.rb', line 220

def standarize
  ds=dup()
  ds.fields.each {|f|
  ds[f]=ds[f].vector_standarized
  }
  ds
end

#summaryObject



782
783
784
785
786
787
788
789
790
791
792
793
# File 'lib/statsample/dataset.rb', line 782

def summary
  out=""
  out << "Summary for dataset\n"
  @vectors.each{|k,v|
    out << "###############\n"
    out << "Vector #{k}:\n"
    out << v.summary
    out << "###############\n"
    
  }
  out 
end

#to_gsl_matrixObject



142
143
144
145
146
147
148
# File 'lib/statsample/dataset.rb', line 142

def to_gsl_matrix
  matrix=GSL::Matrix.alloc(cases,@vectors.size)
  each_array do |row|
    row.each_index{|y| matrix.set(@i,y,row[y]) }
  end
  matrix
end

#to_matrixObject

Return data as a matrix. Column are ordered by #fields and rows by orden of insertion



558
559
560
561
562
563
564
# File 'lib/statsample/dataset.rb', line 558

def to_matrix
  rows=[]
  self.each_array{|c|
    rows.push(c)
  }
  Matrix.rows(rows)
end

#to_matrix_gslObject



567
568
569
570
571
572
573
# File 'lib/statsample/dataset.rb', line 567

def to_matrix_gsl
rows=[]
self.each_array{|c|
  rows.push(c)
}
GSL::Matrix.alloc(*rows)
end

#to_multiset_by_split(*fields) ⇒ Object



576
577
578
579
580
581
582
583
# File 'lib/statsample/dataset.rb', line 576

def to_multiset_by_split(*fields)
      require 'statsample/multiset'
      if fields.size==1
to_multiset_by_split_one_field(fields[0])
      else
to_multiset_by_split_multiple_fields(*fields)
      end
end

#to_multiset_by_split_multiple_fields(*fields) ⇒ Object



621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
# File 'lib/statsample/dataset.rb', line 621

def to_multiset_by_split_multiple_fields(*fields)
  factors_total=nil
  fields.each do |f|
    if factors_total.nil?
      factors_total=@vectors[f].factors.collect{|c|
        [c]
      }
    else
      suma=[]
      factors=@vectors[f].factors
      factors_total.each{|f1| factors.each{|f2| suma.push(f1+[f2]) } }
      factors_total=suma
    end
  end
  ms=Multiset.new_empty_vectors(@fields,factors_total)
  p1=eval "Proc.new {|c| ms[["+fields.collect{|f| "c['#{f}']"}.join(",")+"]].add_case(c,false) }"
  each{|c| p1.call(c)}
  ms.datasets.each do |k,ds|
    ds.update_valid_data
    ds.vectors.each{|k1,v1| v1.type=@vectors[k1].type }
  end
  ms
  
end

#to_multiset_by_split_one_field(field) ⇒ Object

Raises:

  • (ArgumentError)


604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
# File 'lib/statsample/dataset.rb', line 604

def to_multiset_by_split_one_field(field)
  raise ArgumentError,"Should use a correct field name" if !@fields.include? field
  factors=@vectors[field].factors
  ms=Multiset.new_empty_vectors(@fields,factors)
  each {|c|
    ms[c[field]].add_case(c,false)
  }
  #puts "Ingreso a los dataset"
  ms.datasets.each {|k,ds|
    ds.update_valid_data
    ds.vectors.each{|k1,v1|
      #        puts "Vector #{k1}:"+v1.to_s
      v1.type=@vectors[k1].type
    }
  }
  ms
end

#to_sObject



704
705
706
# File 'lib/statsample/dataset.rb', line 704

def to_s
  "#<"+self.class.to_s+":"+self.object_id.to_s+" @fields=["+@fields.join(",")+"] labels="+@labels.inspect+" cases="+@vectors[@fields[0]].size.to_s
end

#update_valid_dataObject

Check vectors and fields after inserting data. Use only after #add_case_array or #add_case with second parameter to false



298
299
300
301
# File 'lib/statsample/dataset.rb', line 298

def update_valid_data
  @fields.each{|f| @vectors[f].set_valid_data}
  check_length
end

#vector_by_calculation(type = :scale) ⇒ Object



324
325
326
327
328
329
330
# File 'lib/statsample/dataset.rb', line 324

def vector_by_calculation(type=:scale)
  a=[]
  each {|row|
    a.push(yield(row))
  }
  a.to_vector(type)
end

#vector_count_characters(fields = nil) ⇒ Object



360
361
362
363
364
365
366
367
# File 'lib/statsample/dataset.rb', line 360

def vector_count_characters(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0){|a,v|
      a+((@vectors[v].data_with_nils[i].nil?) ? 0: row[v].to_s.size)
    }
  end
end

#vector_mean(fields = nil, max_invalid = 0) ⇒ Object

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields



372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
# File 'lib/statsample/dataset.rb', line 372

def vector_mean(fields=nil,max_invalid=0)
  a=[]
  fields=check_fields(fields)
  size=fields.size
  each_with_index do |row, i |
    # numero de invalidos
    sum=0
    invalids=0
    fields.each{|f|
      if !@vectors[f].data_with_nils[i].nil?
        sum+=row[f].to_f
      else
        invalids+=1
      end
    }
    if(invalids>max_invalid)
      a.push(nil)
    else
      a.push(sum.quo(size-invalids))
    end
  end
  a.to_vector(:scale)
end

#vector_missing_values(fields = nil) ⇒ Object

Returns a vector with the numbers of missing values for a case



352
353
354
355
356
357
358
359
# File 'lib/statsample/dataset.rb', line 352

def vector_missing_values(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0) {|a,v|
      a+ ((@vectors[v].data_with_nils[i].nil?) ? 1: 0)
    }
  end
end

#vector_sum(fields = nil) ⇒ Object

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields



333
334
335
336
337
338
339
340
341
342
343
# File 'lib/statsample/dataset.rb', line 333

def vector_sum(fields=nil)
  a=[]
  fields||=@fields
  collect_with_index do |row, i|
    if(fields.find{|f| !@vectors[f].data_with_nils[i]})
      nil
    else
      fields.inject(0) {|ac,v| ac + row[v].to_f}
    end
    end
end

#verify(*tests) ⇒ Object

Test each row with one or more tests each test is a Proc with the form

Proc.new {|row| row['age']>0}

The function returns an array with all errors



681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
# File 'lib/statsample/dataset.rb', line 681

def verify(*tests)
  if(tests[0].is_a? String)
    id=tests[0]
    tests.shift
  else
    id=@fields[0]
  end
  vr=[]
  i=0
  each do |row|
    i+=1
    tests.each{|test|
      if ! test[2].call(row)
        values=""
        if test[1].size>0
          values=" ("+test[1].collect{|k| "#{k}=#{row[k]}"}.join(", ")+")"
        end
        vr.push("#{i} [#{row[id]}]: #{test[0]}#{values}")
      end
    }
  end
  vr
end