Class: Statsample::Dataset

Inherits:
Object show all
Includes:
Summarizable, Writable
Defined in:
lib/statsample/dataset.rb,
lib/statsample/rserve_extension.rb

Overview

Set of cases with values for one or more variables, analog to a dataframe on R or a standard data file of SPSS. Every vector has #field name, which represent it. By default, the vectors are ordered by it field name, but you can change it the fields order manually. The Dataset work as a Hash, with keys are field names and values are Statsample::Vector

Usage

Create a empty dataset:

Dataset.new()

Create a dataset with three empty vectors, called v1, v2 and v3:

Dataset.new(%w{v1 v2 v3})

Create a dataset with two vectors, called v1 and v2:

Dataset.new({'v1'=>%w{1 2 3}.to_vector, 'v2'=>%w{4 5 6}.to_vector})

Create a dataset with two given vectors (v1 and v2), with vectors on inverted order:

Dataset.new({'v2'=>v2,'v1'=>v1},['v2','v1'])

The fast way to create a dataset uses Hash#to_dataset, with field order as arguments

v1 = [1,2,3].to_numeric
v2 = [1,2,3].to_numeric
ds = {'v1'=>v2, 'v2'=>v2}.to_dataset(%w{v2 v1})

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Summarizable

#summary

Methods included from Writable

#save

Constructor Details

#initialize(vectors = {}, fields = []) ⇒ Dataset

Creates a new dataset. A dataset is a set of ordered named vectors of the same size.

vectors

With an array, creates a set of empty vectors named as

values on the array. With a hash, each Vector is assigned as a variable of the Dataset named as its key

fields

Array of names for vectors. Is only used for set the

order of variables. If empty, vectors keys on alfabethic order as used as fields.



158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# File 'lib/statsample/dataset.rb', line 158

def initialize(vectors={}, fields=[])
  @@n_dataset||=0
  @@n_dataset+=1
  @name=_("Dataset %d") % @@n_dataset
  @cases=0
  @gsl=nil
  @i=nil

  if vectors.instance_of? Array
    @fields=vectors.dup
    @vectors=vectors.inject({}){|a,x| a[x]=Statsample::Vector.new(); a}
  else
    # Check vectors
    @vectors=vectors
    @fields=fields
    check_order
    check_length
  end
end

Instance Attribute Details

#casesObject (readonly)

Number of cases



69
70
71
# File 'lib/statsample/dataset.rb', line 69

def cases
  @cases
end

#fieldsObject

Ordered ids of vectors



65
66
67
# File 'lib/statsample/dataset.rb', line 65

def fields
  @fields
end

#iObject (readonly)

Location of pointer on enumerations methods (like #each)



71
72
73
# File 'lib/statsample/dataset.rb', line 71

def i
  @i
end

#nameObject

Name of dataset



67
68
69
# File 'lib/statsample/dataset.rb', line 67

def name
  @name
end

#vectorsObject (readonly)

Hash of Statsample::Vector



63
64
65
# File 'lib/statsample/dataset.rb', line 63

def vectors
  @vectors
end

Class Method Details

.crosstab_by_asignation(rows, columns, values) ⇒ Object

Generates a new dataset, using three vectors

  • Rows

  • Columns

  • Values

For example, you have these values

x   y   v
a   a   0
a   b   1
b   a   1
b   b   0

You obtain

id  a   b
 a  0   1
 b  1   0

Useful to process outputs from databases



92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/statsample/dataset.rb', line 92

def self.crosstab_by_asignation(rows,columns,values)
  raise "Three vectors should be equal size" if rows.size!=columns.size or rows.size!=values.size
  cols_values=columns.factors
  cols_n=cols_values.size
  h_rows=rows.factors.inject({}){|a,v| a[v]=cols_values.inject({}){
    |a1,v1| a1[v1]=nil; a1
    }
    ;a}
  values.each_index{|i|
    h_rows[rows[i]][columns[i]]=values[i]
  }
  ds=Dataset.new(["_id"]+cols_values)
  cols_values.each{|c|
    ds[c].type=values.type
  }
  rows.factors.each {|row|
    n_row=Array.new(cols_n+1)
    n_row[0]=row
      cols_values.each_index {|i|
        n_row[i+1]=h_rows[row][cols_values[i]]
    }
    ds.add_case_array(n_row)
  }
  ds.update_valid_data
  ds
end

Instance Method Details

#==(d2) ⇒ Boolean

We have the same datasets if vectors and fields are the same

Returns:

  • (Boolean)


370
371
372
# File 'lib/statsample/dataset.rb', line 370

def ==(d2)
  @vectors==d2.vectors and @fields==d2.fields
end

#[](i) ⇒ Object

Returns the vector named i



670
671
672
673
674
675
676
677
678
679
680
# File 'lib/statsample/dataset.rb', line 670

def[](i)
  if i.is_a? Range
    fields=from_to(i.begin,i.end)
    clone(*fields)
  elsif i.is_a? Array
    clone(i)
  else
    raise Exception,"Vector '#{i}' doesn't exists on dataset" unless @vectors.has_key?(i)
    @vectors[i]
  end
end

#[]=(i, v) ⇒ Object



709
710
711
712
713
714
715
716
# File 'lib/statsample/dataset.rb', line 709

def[]=(i,v)
  if v.instance_of? Statsample::Vector
    @vectors[i]=v
    check_order
  else
    raise ArgumentError,"Should pass a Statsample::Vector"
  end
end

#_case_as_array(c) ⇒ Object

:nodoc:



598
599
600
# File 'lib/statsample/dataset.rb', line 598

def _case_as_array(c) # :nodoc:
  @fields.collect {|x| @vectors[x][c]}
end

#_case_as_hash(c) ⇒ Object

:nodoc:



595
596
597
# File 'lib/statsample/dataset.rb', line 595

def _case_as_hash(c) # :nodoc:
  @fields.inject({}) {|a,x| a[x]=@vectors[x][c];a }
end

#add_case(v, uvd = true) ⇒ Object

Insert a case, using:

  • Array: size equal to number of vectors and values in the same order as fields

  • Hash: keys equal to fields

If uvd is false, #update_valid_data is not executed after inserting a case. This is very useful if you want to increase the performance on inserting many cases, because #update_valid_data performs check on vectors and on the dataset



424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
# File 'lib/statsample/dataset.rb', line 424

def add_case(v,uvd=true)
  case v
  when Array
    if (v[0].is_a? Array)
      v.each{|subv| add_case(subv,false)}
    else
      raise ArgumentError, "Input array size (#{v.size}) should be equal to fields number (#{@fields.size})" if @fields.size!=v.size
      v.each_index {|i| @vectors[@fields[i]].add(v[i],false)}
    end
  when Hash
    raise ArgumentError, "Hash keys should be equal to fields #{(v.keys - @fields).join(",")}" if @fields.sort!=v.keys.sort
    @fields.each{|f| @vectors[f].add(v[f],false)}
  else
    raise TypeError, 'Value must be a Array or a Hash'
  end
  if uvd
    update_valid_data
  end
end

#add_case_array(v) ⇒ Object

Fast version of #add_case. Can only add one case and no error check if performed You SHOULD use #update_valid_data at the end of insertion cycle



413
414
415
# File 'lib/statsample/dataset.rb', line 413

def add_case_array(v)
  v.each_index {|i| d=@vectors[@fields[i]].data; d.push(v[i])}
end

#add_vector(name, vector) ⇒ Object

Equal to Dataset[name]=vector

Returns:

  • self

Raises:

  • (ArgumentError)


383
384
385
386
387
388
# File 'lib/statsample/dataset.rb', line 383

def add_vector(name, vector)
  raise ArgumentError, "Vector have different size" if vector.size!=@cases
  @vectors[name]=vector
  check_order
  self
end

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object



473
474
475
476
477
478
# File 'lib/statsample/dataset.rb', line 473

def add_vectors_by_split(name,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name].split_by_separator(sep)
  split.each{|k,v|
    add_vector(name+join+k,v)
  }
end

#add_vectors_by_split_recode(name_, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object



463
464
465
466
467
468
469
470
471
472
# File 'lib/statsample/dataset.rb', line 463

def add_vectors_by_split_recode(name_,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name_].split_by_separator(sep)
  i=1
  split.each{|k,v|
    new_field=name_+join+i.to_s
    v.name=name_+":"+k
    add_vector(new_field,v)
    i+=1
  }
end

#bootstrap(n = nil) ⇒ Statsample::Dataset

Creates a dataset with the random data, of a n size If n not given, uses original number of cases.

Returns:



399
400
401
402
403
404
405
406
407
# File 'lib/statsample/dataset.rb', line 399

def bootstrap(n=nil)
  n||=@cases
  ds_boot=dup_empty
  n.times do
    ds_boot.add_case_array(case_as_array(rand(n)))
  end
  ds_boot.update_valid_data
  ds_boot
end

#case_as_array(i) ⇒ Object

Retrieves case i as a array, ordered on #fields order



586
587
588
# File 'lib/statsample/dataset.rb', line 586

def case_as_array(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_array(self,c)
end

#case_as_hash(i) ⇒ Object

Retrieves case i as a hash



575
576
577
# File 'lib/statsample/dataset.rb', line 575

def case_as_hash(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_hash(self,c)
end

#check_fields(fields) ⇒ Object

Check if #fields attribute is correct, after inserting or deleting vectors



502
503
504
505
506
# File 'lib/statsample/dataset.rb', line 502

def check_fields(fields)
  fields||=@fields
  raise "Fields #{(fields-@fields).join(", ")} doesn't exists on dataset" if (fields-@fields).size>0
  fields
end

#check_lengthObject

Check vectors for type and size.



555
556
557
558
559
560
561
562
563
564
565
566
567
568
# File 'lib/statsample/dataset.rb', line 555

def check_length # :nodoc:
  size=nil
  @vectors.each do |k,v|
    raise Exception, "Data #{v.class} is not a vector on key #{k}" if !v.is_a? Statsample::Vector
    if size.nil?
      size=v.size
    else
      if v.size!=size
        raise Exception, "Vector #{k} have size #{v.size} and dataset have size #{size}"
      end
    end
  end
  @cases=size
end

#check_orderObject

Check congruence between fields attribute and keys on +vectors



663
664
665
666
667
668
# File 'lib/statsample/dataset.rb', line 663

def check_order #:nodoc:
  if(@vectors.keys.sort!=@fields.sort)
    @fields=@fields&@vectors.keys
    @fields+=@vectors.keys.sort-@fields
  end
end

#clear_gslObject



728
729
730
# File 'lib/statsample/dataset.rb', line 728

def clear_gsl
  @gsl=nil
end

#clone(*fields_to_include) ⇒ Statsample::Dataset

Returns a shallow copy of Dataset. Object id will be distinct, but @vectors will be the same.

Parameters:

  • array

    of fields to include. No value include all fields

Returns:



257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# File 'lib/statsample/dataset.rb', line 257

def clone(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields.dup if fields_to_include.size==0
  ds=Dataset.new
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    ds[f]=@vectors[f]
  }
  ds.fields=fields_to_include
  ds.name=@name
  ds.update_valid_data
  ds
end

#clone_only_valid(*fields_to_include) ⇒ Statsample::Dataset

Returns (when possible) a cheap copy of dataset. If no vector have missing values, returns original vectors. If missing values presents, uses Dataset.dup_only_valid.

Parameters:

  • array

    of fields to include. No value include all fields

Returns:



242
243
244
245
246
247
248
249
250
251
252
# File 'lib/statsample/dataset.rb', line 242

def clone_only_valid(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields.dup if fields_to_include.size==0
  if fields_to_include.any? {|v| @vectors[v].has_missing_data?}
    dup_only_valid(fields_to_include)
  else
    clone(fields_to_include)
  end
end

#col(c) ⇒ Statsample::Vector Also known as: vector

Returns vector c

Returns:



376
377
378
# File 'lib/statsample/dataset.rb', line 376

def col(c)
  @vectors[c]
end

#collect(type = :numeric) ⇒ Object

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.



683
684
685
686
687
688
689
# File 'lib/statsample/dataset.rb', line 683

def collect(type=:numeric)
  data=[]
  each {|row|
    data.push yield(row)
  }
  Statsample::Vector.new(data,type)
end

#collect_matrix::Matrix

Generate a matrix, based on fields of dataset

Returns:



358
359
360
361
362
363
364
365
# File 'lib/statsample/dataset.rb', line 358

def collect_matrix
  rows=@fields.collect{|row|
    @fields.collect{|col|
      yield row,col
    }
  }
  Matrix.rows(rows)
end

#collect_with_index(type = :numeric) ⇒ Object

Same as Statsample::Vector.collect, but giving case index as second parameter on yield.



691
692
693
694
695
696
697
# File 'lib/statsample/dataset.rb', line 691

def collect_with_index(type=:numeric)
  data=[]
  each_with_index {|row, i|
    data.push(yield(row, i))
  }
  Statsample::Vector.new(data,type)
end

#compute(text) ⇒ Object

Returns a vector, based on a string with a calculation based on vector The calculation will be eval’ed, so you can put any variable or expression valid on ruby For example:

a=[1,2].to_vector(scale)
b=[3,4].to_vector(scale)
ds={'a'=>a,'b'=>b}.to_dataset
ds.compute("a+b")
=> Vector [4,6]


870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
# File 'lib/statsample/dataset.rb', line 870

def compute(text)
  @fields.each{|f|
    if @vectors[f].type=:numeric
      text.gsub!(f,"row['#{f}'].to_f")
    else
      text.gsub!(f,"row['#{f}']")
    end
  }
  collect_with_index {|row, i|
    invalid=false
    @fields.each{|f|
      if @vectors[f].data_with_nils[i].nil?
        invalid=true
      end
    }
    if invalid
      nil
    else
      eval(text)
    end
  }
end

#correlation_matrix(fields = nil) ⇒ Object

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset



749
750
751
752
753
754
755
756
# File 'lib/statsample/dataset.rb', line 749

def correlation_matrix(fields = nil)
  if fields
    ds = clone(fields)
  else
    ds = self
  end
  Statsample::Bivariate.correlation_matrix(ds)
end

#covariance_matrix(fields = nil) ⇒ Object

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset



760
761
762
763
764
765
766
767
# File 'lib/statsample/dataset.rb', line 760

def covariance_matrix(fields = nil)
  if fields
    ds = clone(fields)
  else
    ds = self
  end
  Statsample::Bivariate.covariance_matrix(ds)
end

#crosstab(v1, v2, opts = {}) ⇒ Object



706
707
708
# File 'lib/statsample/dataset.rb', line 706

def crosstab(v1,v2,opts={})
  Statsample::Crosstab.new(@vectors[v1], @vectors[v2],opts)
end

#delete_vector(*args) ⇒ Object

Delete vector named name. Multiple fields accepted.



451
452
453
454
455
456
457
458
459
460
461
# File 'lib/statsample/dataset.rb', line 451

def delete_vector(*args)
  if args.size==1 and args[0].is_a? Array
    names=args[0]
  else
    names=args
  end
  names.each do |name|
    @fields.delete(name)
    @vectors.delete(name)
  end
end

#dup(*fields_to_include) ⇒ Statsample::Dataset

Returns a duplicate of the Dataset. All vectors are copied, so any modification on new dataset doesn’t affect original dataset’s vectors. If fields given as parameter, only include those vectors.

Parameters:

  • array

    of fields to include. No value include all fields

Returns:



211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# File 'lib/statsample/dataset.rb', line 211

def dup(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  vectors={}
  fields=[]
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    vectors[f]=@vectors[f].dup
    fields.push(f)
  }
  ds=Dataset.new(vectors,fields)
  ds.name= self.name
  ds
end

#dup_emptyStatsample::Dataset

Creates a copy of the given dataset, without data on vectors

Returns:



275
276
277
278
279
280
281
# File 'lib/statsample/dataset.rb', line 275

def dup_empty
  vectors=@vectors.inject({}) {|a,v|
    a[v[0]]=v[1].dup_empty
    a
  }
  Dataset.new(vectors,@fields.dup)
end

#dup_only_valid(*fields_to_include) ⇒ Object

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors.

Parameters:

  • array

    of fields to include. No value include all fields



183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
# File 'lib/statsample/dataset.rb', line 183

def dup_only_valid(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  if fields_to_include.any? {|f| @vectors[f].has_missing_data?}
    ds=Dataset.new(fields_to_include)
    fields_to_include.each {|f| ds[f].type=@vectors[f].type}
    each {|row|
      unless fields_to_include.any? {|f| @vectors[f].has_missing_data? and !@vectors[f].is_valid? row[f]}
        row_2=fields_to_include.inject({}) {|ac,v| ac[v]=row[v]; ac}
        ds.add_case(row_2)
      end
    }
  else
    ds=dup fields_to_include
  end
  ds.name= self.name
  ds
end

#eachObject

Returns each case as a hash



603
604
605
606
607
608
609
610
611
612
613
614
615
# File 'lib/statsample/dataset.rb', line 603

def each
  begin
    @i=0
    @cases.times {|i|
      @i=i
      row=case_as_hash(i)
      yield row
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#each_arrayObject

Returns each case as an array



647
648
649
650
651
652
653
654
# File 'lib/statsample/dataset.rb', line 647

def each_array
  @cases.times {|i|
    @i=i
    row=case_as_array(i)
    yield row
  }
  @i=nil
end

#each_array_with_nilsObject

Returns each case as an array, coding missing values as nils



633
634
635
636
637
638
639
640
641
642
643
644
645
# File 'lib/statsample/dataset.rb', line 633

def each_array_with_nils
  m=fields.size
  @cases.times {|i|
    @i=i
    row=Array.new(m)
    fields.each_index{|j|
      f=fields[j]
      row[j]=@vectors[f].data_with_nils[i]
    }
    yield row
  }
  @i=nil
end

#each_vectorObject

Retrieves each vector as [key, vector]



570
571
572
# File 'lib/statsample/dataset.rb', line 570

def each_vector # :yield: |key, vector|
  @fields.each{|k| yield k, @vectors[k]}
end

#each_with_indexObject

Returns each case as hash and index



618
619
620
621
622
623
624
625
626
627
628
629
630
# File 'lib/statsample/dataset.rb', line 618

def each_with_index # :yield: |case, i|
  begin
    @i=0
    @cases.times{|i|
      @i=i
      row=case_as_hash(i)
      yield row, i
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#filterObject

Create a new dataset with all cases which the block returns true



770
771
772
773
774
775
776
777
778
# File 'lib/statsample/dataset.rb', line 770

def filter
  ds=self.dup_empty
  each {|c|
    ds.add_case(c, false) if yield c
  }
  ds.update_valid_data
  ds.name=_("%s(filtered)") % @name
  ds
end

#filter_field(field) ⇒ Object

creates a new vector with the data of a given field which the block returns true



781
782
783
784
785
786
787
# File 'lib/statsample/dataset.rb', line 781

def filter_field(field)
  a=[]
  each do |c|
    a.push(c[field]) if yield c
  end
  a.to_vector(@vectors[field].type)
end

#from_to(from, to) ⇒ Object

Returns an array with the fields from first argumen to last argument

Raises:

  • (ArgumentError)


230
231
232
233
234
# File 'lib/statsample/dataset.rb', line 230

def from_to(from,to)
  raise ArgumentError, "Field #{from} should be on dataset" if !@fields.include? from
  raise ArgumentError, "Field #{to} should be on dataset" if !@fields.include? to
  @fields.slice(@fields.index(from)..@fields.index(to))
end

#has_missing_data?Boolean

Return true if any vector has missing data

Returns:

  • (Boolean)


119
120
121
# File 'lib/statsample/dataset.rb', line 119

def has_missing_data?
  @vectors.any? {|k,v| v.has_missing_data?}
end

#has_vector?(v) ⇒ Boolean

Returns true if dataset have vector v.

Returns:

  • (Boolean)


392
393
394
# File 'lib/statsample/dataset.rb', line 392

def has_vector? (v)
  return @vectors.has_key?(v)
end

#inspectObject



922
923
924
# File 'lib/statsample/dataset.rb', line 922

def inspect
  self.to_s
end

#join(other_ds, fields_1 = [], fields_2 = [], type = :left) ⇒ Statsample::Dataset

Join 2 Datasets by given fields type is one of :left and :inner, default is :left

Returns:



308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
# File 'lib/statsample/dataset.rb', line 308

def join(other_ds,fields_1=[],fields_2=[],type=:left)
  fields_new = other_ds.fields - fields_2
  fields = self.fields + fields_new

  other_ds_hash = {}
  other_ds.each do |row|
    key = row.select{|k,v| fields_2.include?(k)}.values
    value = row.select{|k,v| fields_new.include?(k)}
    if other_ds_hash[key].nil?
      other_ds_hash[key] = [value]
    else
      other_ds_hash[key] << value
    end
  end

  new_ds = Dataset.new(fields)

  self.each do |row|
    key = row.select{|k,v| fields_1.include?(k)}.values

    new_case = row.dup

    if other_ds_hash[key].nil?
      if type == :left
        fields_new.each{|field| new_case[field] = nil}
        new_ds.add_case(new_case)
      end
    else
      other_ds_hash[key].each do |new_values|
        new_ds.add_case new_case.merge(new_values)
      end
    end

  end
  new_ds
end

#merge(other_ds) ⇒ Statsample::Dataset

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…

Returns:



287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
# File 'lib/statsample/dataset.rb', line 287

def merge(other_ds)
  raise "Cases should be equal (this:#{@cases}; other:#{other_ds.cases}" unless @cases==other_ds.cases
  types = @fields.collect{|f| @vectors[f].type} + other_ds.fields.collect{|f| other_ds[f].type}
  new_fields = (@fields+other_ds.fields).recode_repeated
  ds_new=Statsample::Dataset.new(new_fields)
  new_fields.each_index{|i|
    field=new_fields[i]
    ds_new[field].type=types[i]
  }
  @cases.times {|i|
    row=case_as_array(i)+other_ds.case_as_array(i)
    ds_new.add_case_array(row)
  }
  ds_new.update_valid_data
  ds_new
end

#nest(*tree_keys, &block) ⇒ Object

Return a nested hash using fields as keys and an array constructed of hashes with other values. If block provided, is used to provide the values, with parameters row of dataset, current last hash on hierarchy and name of the key to include



128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/statsample/dataset.rb', line 128

def nest(*tree_keys,&block)
  tree_keys=tree_keys[0] if tree_keys[0].is_a? Array
  out=Hash.new
  each do |row|
    current=out
    # Create tree
    tree_keys[0,tree_keys.size-1].each do |f|
      root=row[f]
      current[root]||=Hash.new
      current=current[root]
    end
    name=row[tree_keys.last]
    if !block
      current[name]||=Array.new
      current[name].push(row.delete_if{|key,value| tree_keys.include? key})
    else
      current[name]=block.call(row, current,name)
    end
  end
  out
end

#one_to_many(parent_fields, pattern) ⇒ Object

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.

for example, you have a survey for number of children with this structure:

id, name, child_name_1, child_age_1, child_name_2, child_age_2

with

ds.one_to_many(%w{id}, "child_%v_%n"

the field of first parameters will be copied verbatim to new dataset, and fields which responds to second pattern will be added one case for each different %n. For example

cases=[
  ['1','george','red',10,'blue',20,nil,nil],
  ['2','fred','green',15,'orange',30,'white',20],
  ['3','alfred',nil,nil,nil,nil,nil,nil]
]
ds=Statsample::Dataset.new(%w{id name car_color1 car_value1 car_color2 car_value2 car_color3 car_value3})
cases.each {|c| ds.add_case_array c }
ds.one_to_many(['id'],'car_%v%n').to_matrix
=> Matrix[
   ["red", "1", 10],
   ["blue", "1", 20],
   ["green", "2", 15],
   ["orange", "2", 30],
   ["white", "2", 20]
   ]


953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
# File 'lib/statsample/dataset.rb', line 953

def one_to_many(parent_fields, pattern)
  #base_pattern=pattern.gsub(/%v|%n/,"")
  re=Regexp.new pattern.gsub("%v","(.+?)").gsub("%n","(\\d+?)")
  ds_vars=parent_fields
  vars=[]
  max_n=0
  h=parent_fields.inject({}) {|a,v| a[v]=Statsample::Vector.new([], @vectors[v].type);a }
  # Adding _row_id
  h['_col_id']=[].to_numeric
  ds_vars.push("_col_id")
  @fields.each do |f|
    if f=~re
      if !vars.include? $1
        vars.push($1)
        h[$1]=Statsample::Vector.new([], @vectors[f].type)
      end
      max_n=$2.to_i if max_n < $2.to_i
    end
  end
  ds=Dataset.new(h,ds_vars+vars)
  each do |row|
    row_out={}
    parent_fields.each do |f|
      row_out[f]=row[f]
    end
    max_n.times do |n1|
      n=n1+1
      any_data=false
      vars.each do |v|
        data=row[pattern.gsub("%v",v.to_s).gsub("%n",n.to_s)]
        row_out[v]=data
        any_data=true if !data.nil?
      end
      if any_data
        row_out["_col_id"]=n
        ds.add_case(row_out,false)
      end

    end
  end
  ds.update_valid_data
  ds
end

#recode!(vector_name) ⇒ Object

Recode a vector based on a block



699
700
701
702
703
704
# File 'lib/statsample/dataset.rb', line 699

def recode!(vector_name)
  0.upto(@cases-1) {|i|
    @vectors[vector_name].data[i]=yield case_as_hash(i)
  }
  @vectors[vector_name].set_valid_data
end

#report_building(b) ⇒ Object



996
997
998
999
1000
1001
1002
1003
1004
# File 'lib/statsample/dataset.rb', line 996

def report_building(b)
  b.section(:name=>@name) do |g|
    g.text _"Cases: %d"  % cases
    @fields.each do |f|
      g.text "Element:[#{f}]"
      g.parse_element(@vectors[f])
    end
  end
end

#standarizeStatsample::Dataset

Returns a dataset with standarized data.

Returns:



347
348
349
350
351
352
353
# File 'lib/statsample/dataset.rb', line 347

def standarize
  ds=dup()
  ds.fields.each do |f|
    ds[f]=ds[f].vector_standarized
  end
  ds
end

#to_gslObject



732
733
734
735
736
737
738
739
740
741
742
743
# File 'lib/statsample/dataset.rb', line 732

def to_gsl
  if @gsl.nil?
    if cases.nil?
      update_valid_data
    end
    @gsl=GSL::Matrix.alloc(cases,fields.size)
    self.each_array{|c|
      @gsl.set_row(@i,c)
    }
  end
  @gsl
end

#to_matrixObject

Return data as a matrix. Column are ordered by #fields and rows by orden of insertion



719
720
721
722
723
724
725
# File 'lib/statsample/dataset.rb', line 719

def to_matrix
  rows=[]
  self.each_array{|c|
    rows.push(c)
  }
  Matrix.rows(rows)
end

#to_multiset_by_split(*fields) ⇒ Object



793
794
795
796
797
798
799
800
# File 'lib/statsample/dataset.rb', line 793

def to_multiset_by_split(*fields)
			require 'statsample/multiset'
			if fields.size==1
to_multiset_by_split_one_field(fields[0])
			else
to_multiset_by_split_multiple_fields(*fields)
			end
end

#to_multiset_by_split_multiple_fields(*fields) ⇒ Object



824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
# File 'lib/statsample/dataset.rb', line 824

def to_multiset_by_split_multiple_fields(*fields)
  factors_total=nil
  fields.each do |f|
    if factors_total.nil?
      factors_total=@vectors[f].factors.collect{|c|
        [c]
      }
    else
      suma=[]
      factors=@vectors[f].factors
      factors_total.each{|f1| factors.each{|f2| suma.push(f1+[f2]) } }
      factors_total=suma
    end
  end
  ms=Multiset.new_empty_vectors(@fields,factors_total)

  p1=eval "Proc.new {|c| ms[["+fields.collect{|f| "c['#{f}']"}.join(",")+"]].add_case(c,false) }"
  each{|c| p1.call(c)}

  ms.datasets.each do |k,ds|
    ds.update_valid_data
    ds.name=fields.size.times.map {|i|
      f=fields[i]
      sk=k[i]
      @vectors[f].labeling(sk)
    }.join("-")
    ds.vectors.each{|k1,v1|
      v1.type=@vectors[k1].type
      v1.name=@vectors[k1].name
      v1.labels=@vectors[k1].labels

    }
  end
  ms

end

#to_multiset_by_split_one_field(field) ⇒ Object

Creates a Statsample::Multiset, using one field

Raises:

  • (ArgumentError)


803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
# File 'lib/statsample/dataset.rb', line 803

def to_multiset_by_split_one_field(field)
  raise ArgumentError,"Should use a correct field name" if !@fields.include? field
  factors=@vectors[field].factors
  ms=Multiset.new_empty_vectors(@fields, factors)
  each {|c|
    ms[c[field]].add_case(c,false)
  }
  #puts "Ingreso a los dataset"
  ms.datasets.each {|k,ds|
    ds.update_valid_data
    ds.name=@vectors[field].labeling(k)
    ds.vectors.each{|k1,v1|
      #        puts "Vector #{k1}:"+v1.to_s
      v1.type=@vectors[k1].type
      v1.name=@vectors[k1].name
      v1.labels=@vectors[k1].labels

    }
  }
  ms
end

#to_REXPObject



11
12
13
14
15
16
17
18
# File 'lib/statsample/rserve_extension.rb', line 11

def to_REXP
  names=@fields
  data=@fields.map {|f|
    Rserve::REXP::Wrapper.wrap(@vectors[f].data_with_nils)
  }
  l=Rserve::Rlist.new(data,names)
  Rserve::REXP.create_data_frame(l)
end

#to_sObject



919
920
921
# File 'lib/statsample/dataset.rb', line 919

def to_s
  "#<"+self.class.to_s+":"+self.object_id.to_s+" @name=#{@name} @fields=["+@fields.join(",")+"] cases="+@vectors[@fields[0]].size.to_s
end

#update_valid_dataObject

Check vectors and fields after inserting data. Use only after #add_case_array or #add_case with second parameter to false



445
446
447
448
449
# File 'lib/statsample/dataset.rb', line 445

def update_valid_data
  @gsl=nil
  @fields.each{|f| @vectors[f].set_valid_data}
  check_length
end

#vector_by_calculation(type = :numeric) ⇒ Object



480
481
482
483
484
485
486
# File 'lib/statsample/dataset.rb', line 480

def vector_by_calculation(type=:numeric)
  a=[]
  each do |row|
    a.push(yield(row))
  end
  a.to_vector(type)
end

#vector_count_characters(fields = nil) ⇒ Object



517
518
519
520
521
522
523
524
# File 'lib/statsample/dataset.rb', line 517

def vector_count_characters(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0){|a,v|
      a+((@vectors[v].data_with_nils[i].nil?) ? 0: row[v].to_s.size)
    }
  end
end

#vector_mean(fields = nil, max_invalid = 0) ⇒ Object

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields



529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
# File 'lib/statsample/dataset.rb', line 529

def vector_mean(fields=nil, max_invalid=0)
  a=[]
  fields=check_fields(fields)
  size=fields.size
  each_with_index do |row, i |
    # numero de invalidos
    sum=0
    invalids=0
    fields.each{|f|
      if !@vectors[f].data_with_nils[i].nil?
        sum+=row[f].to_f
      else
        invalids+=1
      end
    }
    if(invalids>max_invalid)
      a.push(nil)
    else
      a.push(sum.quo(size-invalids))
    end
  end
  a=a.to_vector(:numeric)
  a.name=_("Means from %s") % @name
  a
end

#vector_missing_values(fields = nil) ⇒ Object

Returns a vector with the numbers of missing values for a case



509
510
511
512
513
514
515
516
# File 'lib/statsample/dataset.rb', line 509

def vector_missing_values(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0) {|a,v|
      a+ ((@vectors[v].data_with_nils[i].nil?) ? 1: 0)
    }
  end
end

#vector_sum(fields = nil) ⇒ Object

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields



489
490
491
492
493
494
495
496
497
498
499
500
# File 'lib/statsample/dataset.rb', line 489

def vector_sum(fields=nil)
  fields||=@fields
  vector=collect_with_index do |row, i|
    if(fields.find{|f| !@vectors[f].data_with_nils[i]})
      nil
    else
      fields.inject(0) {|ac,v| ac + row[v].to_f}
    end
  end
  vector.name=_("Sum from %s") % @name
  vector
end

#verify(*tests) ⇒ Object

Test each row with one or more tests each test is a Proc with the form

Proc.new {|row| row['age']>0}

The function returns an array with all errors



896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
# File 'lib/statsample/dataset.rb', line 896

def verify(*tests)
  if(tests[0].is_a? String)
    id=tests[0]
    tests.shift
  else
    id=@fields[0]
  end
  vr=[]
  i=0
  each do |row|
    i+=1
    tests.each{|test|
      if ! test[2].call(row)
        values=""
        if test[1].size>0
          values=" ("+test[1].collect{|k| "#{k}=#{row[k]}"}.join(", ")+")"
        end
        vr.push("#{i} [#{row[id]}]: #{test[0]}#{values}")
      end
    }
  end
  vr
end