Class: Statsample::Dataset

Inherits:

Object

Object
Statsample::Dataset

show all

Includes:: Writable

Defined in:: lib/statsample/dataset.rb

Overview

Set of cases with values for one or more variables, analog to a dataframe on R or a standard data file of SPSS. Every vector has #field name, which represent it. By default, the vectors are ordered by it field name, but you can change it the fields order manually. The Dataset work as a Hash, with keys are field names and values are Statsample::Vector

Usage

Create a empty dataset

Dataset.new()

Create a dataset with three empty vectors, called v1, v2 and v3

Dataset.new(%w{v1 v2 v3})

Create a dataset with two vectors

Dataset.new({'v1'=>%w{1 2 3}.to_vector, 'v2'=>%w{4 5 6}.to_vector})

Create a dataset with two given vectors (v1 and v2), with vectors on inverted order

Dataset.new({'v2'=>v2,'v1'=>v1},['v2','v1'])

The fast way to create a dataset uses Hash#to_dataset, with field order as arguments

v1 = [1,2,3].to_scale
v2 = [1,2,3].to_scale
ds = {'v1'=>v2, 'v2'=>v2}.to_dataset(%w{v2 v1})

Instance Attribute Summary collapse

#cases ⇒ Object readonly

Number of cases.
#fields ⇒ Object

Ordered names of vectors.
#i ⇒ Object readonly

Location of pointer on enumerations methods (like #each).
#labels ⇒ Object

Deprecated: Label of vectors.
#vectors ⇒ Object readonly

Hash of Statsample::Vector.

Class Method Summary collapse

.crosstab_by_asignation(rows, columns, values) ⇒ Object

Generates a new dataset, using three vectors - Rows - Columns - Values.

Instance Method Summary collapse

#==(d2) ⇒ Object

We have the same datasets if the labels and vectors are the same.
#[](i) ⇒ Object

Returns the vector named i.
#[]=(i, v) ⇒ Object
#_case_as_array(c) ⇒ Object

:nodoc:.
#_case_as_hash(c) ⇒ Object

:nodoc:.
#add_case(v, uvd = true) ⇒ Object

Insert a case, using: * Array: size equal to number of vectors and values in the same order as fields * Hash: keys equal to fields If uvd is false, #update_valid_data is not executed after inserting a case.
#add_case_array(v) ⇒ Object

Fast version of #add_case.
#add_vector(name, vector) ⇒ Object
#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object
#add_vectors_by_split_recode(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object
#as_r ⇒ Object
#bootstrap(n = nil) ⇒ Object

Creates a dataset with the random data, of a n size If n not given, uses original number of cases.
#case_as_array(i) ⇒ Object

Retrieves case i as a array, ordered on #fields order.
#case_as_hash(i) ⇒ Object

Retrieves case i as a hash.
#check_fields(fields) ⇒ Object

Check if #fields attribute is correct, after inserting or deleting vectors.
#check_length ⇒ Object

Check vectors for type and size.
#check_order ⇒ Object
#col(c) ⇒ Object (also: #vector)
#collect(type = :scale) ⇒ Object

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.
#collect_matrix ⇒ Object

Generate a matrix, based on fields of dataset.
#collect_with_index(type = :scale) ⇒ Object

Same as #collect, but giving case index as second parameter on yield.
#compute(text) ⇒ Object

Returns a vector, based on a string with a calculation based on vector The calculation will be eval’ed, so you can put any variable or expression valid on ruby For example: a=.to_vector(scale) b=.to_vector(scale) ds=‘a’=>a,‘b’=>b.to_dataset ds.compute(“a+b”) => Vector [4,6].
#crosstab(v1, v2, opts = {}) ⇒ Object
#delete_vector(name) ⇒ Object

Delete a vector.
#dup(*fields_to_include) ⇒ Object

Returns a duplicate of the Database If fields given, only include those vectors.
#dup_empty ⇒ Object

Creates a copy of the given dataset, without data on vectors.
#dup_only_valid ⇒ Object

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors.
#each ⇒ Object

Returns each case as a hash.
#each_array ⇒ Object

Returns each case as an array.
#each_array_with_nils ⇒ Object

Returns each case as an array, coding missing values as nils.
#each_vector ⇒ Object

Retrieves each vector as [key, vector].
#each_with_index ⇒ Object

Returns each case as hash and index.
#filter ⇒ Object

Create a new dataset with all cases which the block returns true.
#filter_field(field) ⇒ Object

creates a new vector with the data of a given field which the block returns true.
#from_to(from, to) ⇒ Object

Returns an array with the fields from first argumen to last argument.
#has_vector?(v) ⇒ Boolean
#initialize(vectors = {}, fields = [], labels = {}) ⇒ Dataset constructor

A new instance of Dataset.
#inspect ⇒ Object
#label(v_id) ⇒ Object

Retrieves label for a vector, giving a field name.
#merge(other_ds) ⇒ Object

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…
#one_to_many(parent_fields, pattern) ⇒ Object

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.
#recode!(vector_name) ⇒ Object

Recode a vector based on a block.
#standarize ⇒ Object

Returns a dataset with standarized data.
#summary ⇒ Object
#to_gsl_matrix ⇒ Object
#to_matrix ⇒ Object

Return data as a matrix.
#to_matrix_gsl ⇒ Object
#to_multiset_by_split(*fields) ⇒ Object
#to_multiset_by_split_multiple_fields(*fields) ⇒ Object
#to_multiset_by_split_one_field(field) ⇒ Object
#to_s ⇒ Object
#update_valid_data ⇒ Object

Check vectors and fields after inserting data.
#vector_by_calculation(type = :scale) ⇒ Object
#vector_count_characters(fields = nil) ⇒ Object
#vector_mean(fields = nil, max_invalid = 0) ⇒ Object

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields.
#vector_missing_values(fields = nil) ⇒ Object

Returns a vector with the numbers of missing values for a case.
#vector_sum(fields = nil) ⇒ Object

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields.
#verify(*tests) ⇒ Object

Test each row with one or more tests each test is a Proc with the form Proc.new {|row| row>0} The function returns an array with all errors.

Methods included from Writable

#save

Constructor Details

#initialize(vectors = {}, fields = [], labels = {}) ⇒ `Dataset`

Returns a new instance of Dataset.

# File 'lib/statsample/dataset.rb', line 128

def initialize(vectors={}, fields=[], labels={})
  if vectors.instance_of? Array
    @fields=vectors.dup
    @vectors=vectors.inject({}){|a,x| a[x]=Statsample::Vector.new(); a}
  else
    # Check vectors
    @vectors=vectors
    @fields=fields
    check_order
    check_length
  end
  @i=nil
  @labels=labels
end

Instance Attribute Details

#cases ⇒ `Object` (readonly)

Number of cases



64
65
66

# File 'lib/statsample/dataset.rb', line 64

def cases
  @cases
end

#fields ⇒ `Object`

Ordered names of vectors



62
63
64

# File 'lib/statsample/dataset.rb', line 62

def fields
  @fields
end

#i ⇒ `Object` (readonly)

Location of pointer on enumerations methods (like #each)



66
67
68

# File 'lib/statsample/dataset.rb', line 66

def i
  @i
end

#labels ⇒ `Object`

Deprecated: Label of vectors



68
69
70

# File 'lib/statsample/dataset.rb', line 68

def labels
  @labels
end

#vectors ⇒ `Object` (readonly)

Hash of Statsample::Vector



60
61
62

# File 'lib/statsample/dataset.rb', line 60

def vectors
  @vectors
end

Class Method Details

.crosstab_by_asignation(rows, columns, values) ⇒ `Object`

Generates a new dataset, using three vectors

Rows
Columns
Values

For example, you have these values

x   y   v
a   a   0
a   b   1
b   a   1
b   b   0

You obtain

id  a   b
 a  0   1
 b  1   0

Useful to process outputs from databases

# File 'lib/statsample/dataset.rb', line 90

def self.crosstab_by_asignation(rows,columns,values)
  raise "Three vectors should be equal size" if rows.size!=columns.size or rows.size!=values.size
  cols_values=columns.factors
  cols_n=cols_values.size
  h_rows=rows.factors.inject({}){|a,v| a[v]=cols_values.inject({}){
    |a1,v1| a1[v1]=nil; a1
    }
    ;a}
  values.each_index{|i|
    h_rows[rows[i]][columns[i]]=values[i]
  }      
  ds=Dataset.new(["_id"]+cols_values)
  cols_values.each{|c|
    ds[c].type=values.type
  }
  rows.factors.each {|row|
    n_row=Array.new(cols_n+1)
    n_row[0]=row
      cols_values.each_index {|i|
        n_row[i+1]=h_rows[row][cols_values[i]]
    }
    ds.add_case_array(n_row)
  }
  ds.update_valid_data
  ds
end

Instance Method Details

#==(d2) ⇒ `Object`

We have the same datasets if the labels and vectors are the same



237
238
239

# File 'lib/statsample/dataset.rb', line 237

def ==(d2)
  @vectors==d2.vectors and @fields==d2.fields
end

#[](i) ⇒ `Object`

Returns the vector named i

# File 'lib/statsample/dataset.rb', line 507

def[](i)
  if i.is_a? String
    raise Exception,"Vector '#{i}' doesn't exists on dataset" unless @vectors.has_key?(i)
    @vectors[i]
  elsif i.is_a? Range
    fields=from_to(i.begin,i.end)
    vectors=fields.inject({}) {|a,v| a[v]=@vectors[v];a}
    ds=Dataset.new(vectors,fields)
  else
    raise ArgumentError, "You need a String or a Range"
  end
end

#[]=(i, v) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 548

def[]=(i,v)
  if v.instance_of? Statsample::Vector
    @vectors[i]=v
    check_order
  else
    raise ArgumentError,"Should pass a Statsample::Vector"
  end
end

#_case_as_array(c) ⇒ `Object`

:nodoc:



440
441
442

# File 'lib/statsample/dataset.rb', line 440

def _case_as_array(c) # :nodoc:
  @fields.collect {|x| @vectors[x][c]}
end

#_case_as_hash(c) ⇒ `Object`

:nodoc:



437
438
439

# File 'lib/statsample/dataset.rb', line 437

def _case_as_hash(c) # :nodoc:
  @fields.inject({}) {|a,x| a[x]=@vectors[x][c];a }
end

#add_case(v, uvd = true) ⇒ `Object`

Insert a case, using:

Array: size equal to number of vectors and values in the same order as fields
Hash: keys equal to fields

If uvd is false, #update_valid_data is not executed after inserting a case. This is very useful if you want to increase the performance on inserting many cases, because #update_valid_data performs check on vectors and on the dataset

# File 'lib/statsample/dataset.rb', line 277

def add_case(v,uvd=true)
  case v
  when Array
    if (v[0].is_a? Array)
      v.each{|subv| add_case(subv,false)}
    else
      raise ArgumentError, "Input array size (#{v.size}) should be equal to fields number (#{@fields.size})" if @fields.size!=v.size
      v.each_index {|i| @vectors[@fields[i]].add(v[i],false)}
    end
  when Hash
    raise ArgumentError, "Hash keys should be equal to fields #{(v.keys - @fields).join(",")}" if @fields.sort!=v.keys.sort
    @fields.each{|f| @vectors[f].add(v[f],false)}
  else
    raise TypeError, 'Value must be a Array or a Hash'
  end
  if uvd
    update_valid_data
  end
end

#add_case_array(v) ⇒ `Object`

Fast version of #add_case. Can only add one case and no error check if performed You SHOULD use #update_valid_data at the end of insertion cycle



266
267
268

# File 'lib/statsample/dataset.rb', line 266

def add_case_array(v)
  v.each_index {|i| d=@vectors[@fields[i]].data; d.push(v[i])}
end

#add_vector(name, vector) ⇒ `Object`

Raises:

(ArgumentError)

# File 'lib/statsample/dataset.rb', line 244

def add_vector(name,vector)
  raise ArgumentError, "Vector have different size" if vector.size!=@cases
  @vectors[name]=vector
  check_order
end

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 318

def add_vectors_by_split(name,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name].split_by_separator(sep)
  split.each{|k,v|
    add_vector(name+join+k,v)
  }
end

#add_vectors_by_split_recode(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 308

def add_vectors_by_split_recode(name,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name].split_by_separator(sep)
  i=1
  split.each{|k,v|
    new_field=name+join+i.to_s
    @labels[new_field]=name+":"+k
    add_vector(new_field,v)
    i+=1
  }
end

#as_r ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 794

def as_r
  require 'rsruby/dataframe'
  r=RSRuby.instance

end

#bootstrap(n = nil) ⇒ `Object`

Creates a dataset with the random data, of a n size If n not given, uses original number of cases

# File 'lib/statsample/dataset.rb', line 254

def bootstrap(n=nil)
  n||=@cases
  ds_boot=dup_empty
  for i in 1..n
    ds_boot.add_case_array(case_as_array(rand(n)))
  end
  ds_boot.update_valid_data
  ds_boot
end

#case_as_array(i) ⇒ `Object`

Retrieves case i as a array, ordered on #fields order



428
429
430

# File 'lib/statsample/dataset.rb', line 428

def case_as_array(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_array(self,c)
end

#case_as_hash(i) ⇒ `Object`

Retrieves case i as a hash



417
418
419

# File 'lib/statsample/dataset.rb', line 417

def case_as_hash(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_hash(self,c)
end

#check_fields(fields) ⇒ `Object`

Check if #fields attribute is correct, after inserting or deleting vectors

# File 'lib/statsample/dataset.rb', line 345

def check_fields(fields)
  fields||=@fields
  raise "Fields #{(fields-@fields).join(", ")} doesn't exists on dataset" if (fields-@fields).size>0
  fields
end

#check_length ⇒ `Object`

Check vectors for type and size.

# File 'lib/statsample/dataset.rb', line 396

def check_length # :nodoc:
  size=nil
  @vectors.each do |k,v|
    raise Exception, "Data #{v.class} is not a vector on key #{k}" if !v.is_a? Statsample::Vector
    if size.nil?
      size=v.size
    else
      if v.size!=size
        p v.to_a.size
        raise Exception, "Vector #{k} have size #{v.size} and dataset have size #{size}"
      end
    end
  end
  @cases=size
end

#check_order ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 500

def check_order
  if(@vectors.keys.sort!=@fields.sort)
    @fields=@fields&@vectors.keys
    @fields+=@vectors.keys.sort-@fields
  end
end

#col(c) ⇒ `Object` Also known as: vector



240
241
242

# File 'lib/statsample/dataset.rb', line 240

def col(c)
  @vectors[c]
end

#collect(type = :scale) ⇒ `Object`

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.

# File 'lib/statsample/dataset.rb', line 521

def collect(type=:scale)
  data=[]
  each {|row|
    data.push yield(row)
  }
  Statsample::Vector.new(data,type)
end

#collect_matrix ⇒ `Object`

Generate a matrix, based on fields of dataset

# File 'lib/statsample/dataset.rb', line 228

def collect_matrix
  rows=@fields.collect{|row|
    @fields.collect{|col|
      yield row,col
    }
  }
  Matrix.rows(rows)
end

#collect_with_index(type = :scale) ⇒ `Object`

Same as #collect, but giving case index as second parameter on yield.

# File 'lib/statsample/dataset.rb', line 529

def collect_with_index(type=:scale)
  data=[]
  each_with_index {|row, i|
    data.push(yield(row, i))
  }
  Statsample::Vector.new(data,type)
end

#compute(text) ⇒ `Object`

Returns a vector, based on a string with a calculation based on vector The calculation will be eval’ed, so you can put any variable or expression valid on ruby For example:

a=[1,2].to_vector(scale)
b=[3,4].to_vector(scale)
ds={'a'=>a,'b'=>b}.to_dataset
ds.compute("a+b")
=> Vector [4,6]

# File 'lib/statsample/dataset.rb', line 655

def compute(text)
  @fields.each{|f|
    if @vectors[f].type=:scale
      text.gsub!(f,"row['#{f}'].to_f")
    else
      text.gsub!(f,"row['#{f}']")
    end
  }
  collect_with_index {|row, i|
    invalid=false
    @fields.each{|f|
      if @vectors[f].data_with_nils[i].nil?
        invalid=true
      end
    }
    if invalid
      nil
    else
      eval(text)
    end
  }
end

#crosstab(v1, v2, opts = {}) ⇒ `Object`



545
546
547

# File 'lib/statsample/dataset.rb', line 545

def crosstab(v1,v2,opts={})
  Statsample::Crosstab.new(@vectors[v1], @vectors[v2],opts)
end

#delete_vector(name) ⇒ `Object`

Delete a vector

# File 'lib/statsample/dataset.rb', line 303

def delete_vector(name)
  @fields.delete(name)
  @vectors.delete(name)
end

#dup(*fields_to_include) ⇒ `Object`

Returns a duplicate of the Database If fields given, only include those vectors

# File 'lib/statsample/dataset.rb', line 176

def dup(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  vectors={}
  fields=[]
  new_labels={}
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    vectors[f]=@vectors[f].dup
    new_labels[f]=@labels[f]
    fields.push(f)
  }
  Dataset.new(vectors,fields,new_labels)
end

#dup_empty ⇒ `Object`

Creates a copy of the given dataset, without data on vectors

# File 'lib/statsample/dataset.rb', line 193

def dup_empty
  vectors=@vectors.inject({}) {|a,v|
    a[v[0]]=v[1].dup_empty
    a
  }
  Dataset.new(vectors,@fields.dup,@labels.dup)
end

#dup_only_valid ⇒ `Object`

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors

# File 'lib/statsample/dataset.rb', line 156

def dup_only_valid
  if @vectors.any?{|field,vector| vector.has_missing_data?}
    ds=dup_empty
    each_array { |c|
      ds.add_case_array(c) unless @fields.find{|f| @vectors[f].data_with_nils[@i].nil? }
    }
    ds.update_valid_data
  else
    ds=dup()
  end
  ds
end

#each ⇒ `Object`

Returns each case as a hash

# File 'lib/statsample/dataset.rb', line 445

def each
  begin
    @i=0
    @cases.times {|i|
      @i=i
      row=case_as_hash(i)
      yield row
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#each_array ⇒ `Object`

Returns each case as an array

# File 'lib/statsample/dataset.rb', line 487

def each_array
  @cases.times {|i|
    @i=i
    row=case_as_array(i)
    yield row
  }
  @i=nil
end

#each_array_with_nils ⇒ `Object`

Returns each case as an array, coding missing values as nils

# File 'lib/statsample/dataset.rb', line 473

def each_array_with_nils
  m=fields.size
  @cases.times {|i|
    @i=i
    row=Array.new(m)
    fields.each_index{|j|
      f=fields[j]
      row[j]=@vectors[f].data_with_nils[i]
    }
    yield row
  }
  @i=nil
end

#each_vector ⇒ `Object`

Retrieves each vector as [key, vector]



412
413
414

# File 'lib/statsample/dataset.rb', line 412

def each_vector # :yield: |key, vector|
  @fields.each{|k| yield k, @vectors[k]}
end

#each_with_index ⇒ `Object`

Returns each case as hash and index

# File 'lib/statsample/dataset.rb', line 459

def each_with_index # :yield: |case, i|
  begin
    @i=0
    @cases.times{|i|
      @i=i
      row=case_as_hash(i)
      yield row, i
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#filter ⇒ `Object`

Create a new dataset with all cases which the block returns true

# File 'lib/statsample/dataset.rb', line 586

def filter
  ds=self.dup_empty
  each {|c|
    ds.add_case(c,false) if yield c
  }
  ds.update_valid_data
  ds
end

#filter_field(field) ⇒ `Object`

creates a new vector with the data of a given field which the block returns true

# File 'lib/statsample/dataset.rb', line 596

def filter_field(field)
  a=[]
  each {|c|
    a.push(c[field]) if yield c
  }
  a.to_vector(@vectors[field].type)
end

#from_to(from, to) ⇒ `Object`

Returns an array with the fields from first argumen to last argument

Raises:

(ArgumentError)

# File 'lib/statsample/dataset.rb', line 169

def from_to(from,to)
  raise ArgumentError, "Field #{from} should be on dataset" if !@fields.include? from
  raise ArgumentError, "Field #{to} should be on dataset" if !@fields.include? to
  @fields.slice(@fields.index(from)..@fields.index(to))
end

#has_vector?(v) ⇒ `Boolean`

Returns:

(Boolean)



249
250
251

# File 'lib/statsample/dataset.rb', line 249

def has_vector? (v)
  return @vectors.has_key?(v)
end

#inspect ⇒ `Object`



707
708
709

# File 'lib/statsample/dataset.rb', line 707

def inspect
  self.to_s
end

#label(v_id) ⇒ `Object`

Retrieves label for a vector, giving a field name.

# File 'lib/statsample/dataset.rb', line 150

def label(v_id) 
  raise "Vector #{v} doesn't exists" unless @fields.include? v_id
  @labels[v_id].nil? ? v_id : @labels[v_id]
end

#merge(other_ds) ⇒ `Object`

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…

# File 'lib/statsample/dataset.rb', line 203

def merge(other_ds)
  raise "Cases should be equal (this:#{@cases}; other:#{other_ds.cases}" unless @cases==other_ds.cases
  types = @fields.collect{|f| @vectors[f].type} + other_ds.fields.collect{|f| other_ds[f].type}
  new_fields = (@fields+other_ds.fields).recode_repeated
  ds_new=Statsample::Dataset.new(new_fields)
  new_fields.each_index{|i|
    field=new_fields[i]
    ds_new[field].type=types[i]
  }
  @cases.times {|i|
    row=case_as_array(i)+other_ds.case_as_array(i)
    ds_new.add_case_array(row)
  }
  ds_new.update_valid_data
  ds_new
end

#one_to_many(parent_fields, pattern) ⇒ `Object`

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.

for example, you have a survey for number of children with this structure:

id, name, child_name_1, child_age_1, child_name_2, child_age_2

with

ds.one_to_many(%w{id}, "child_%v_%n"

the field of first parameters will be copied verbatim to new dataset, and fields which responds to second pattern will be added one case for each different %n. For example

cases=[
  ['1','george','red',10,'blue',20,nil,nil],
  ['2','fred','green',15,'orange',30,'white',20],
  ['3','alfred',nil,nil,nil,nil,nil,nil]
]
ds=Statsample::Dataset.new(%w{id name car_color1 car_value1 car_color2 car_value2 car_color3 car_value3})
cases.each {|c| ds.add_case_array c }
ds.one_to_many(['id'],'car_%v%n').to_matrix
=> Matrix[
   ["red", "1", 10], 
   ["blue", "1", 20],
   ["green", "2", 15],
   ["orange", "2", 30],
   ["white", "2", 20]
   ]

# File 'lib/statsample/dataset.rb', line 738

def one_to_many(parent_fields, pattern)
  base_pattern=pattern.gsub(/%v|%n/,"")
  re=Regexp.new pattern.gsub("%v","(.+?)").gsub("%n","(\\d+?)")
  ds_vars=parent_fields
  vars=[]
  max_n=0
  h=parent_fields.inject({}) {|a,v| a[v]=Statsample::Vector.new([], @vectors[v].type);a }
  # Adding _row_id
  h['_col_id']=[].to_scale
  ds_vars.push("_col_id")
  @fields.each do |f|
    if f=~re
      if !vars.include? $1
        vars.push($1) 
        h[$1]=Statsample::Vector.new([], @vectors[f].type)
      end
      max_n=$2.to_i if max_n < $2.to_i
    end
  end
  ds=Dataset.new(h,ds_vars+vars)
  each do |row|
    row_out={}
    parent_fields.each do |f|
      row_out[f]=row[f]
    end
    max_n.times do |n1|
      n=n1+1
      any_data=false
      vars.each do |v|
        data=row[pattern.gsub("%v",v.to_s).gsub("%n",n.to_s)]
        row_out[v]=data
        any_data=true if !data.nil?
      end
      if any_data
        row_out["_col_id"]=n
        ds.add_case(row_out,false)
      end
      
    end
  end
  ds.update_valid_data
  ds
end

#recode!(vector_name) ⇒ `Object`

Recode a vector based on a block

# File 'lib/statsample/dataset.rb', line 537

def recode!(vector_name)
  
  0.upto(@cases-1) {|i|
    @vectors[vector_name].data[i]=yield case_as_hash(i)
  }
  @vectors[vector_name].set_valid_data
end

#standarize ⇒ `Object`

Returns a dataset with standarized data

# File 'lib/statsample/dataset.rb', line 220

def standarize
  ds=dup()
  ds.fields.each {|f|
  ds[f]=ds[f].vector_standarized
  }
  ds
end

#summary ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 782

def summary
  out=""
  out << "Summary for dataset\n"
  @vectors.each{|k,v|
    out << "###############\n"
    out << "Vector #{k}:\n"
    out << v.summary
    out << "###############\n"
    
  }
  out 
end

#to_gsl_matrix ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 142

def to_gsl_matrix
  matrix=GSL::Matrix.alloc(cases,@vectors.size)
  each_array do |row|
    row.each_index{|y| matrix.set(@i,y,row[y]) }
  end
  matrix
end

#to_matrix ⇒ `Object`

Return data as a matrix. Column are ordered by #fields and rows by orden of insertion

# File 'lib/statsample/dataset.rb', line 558

def to_matrix
  rows=[]
  self.each_array{|c|
    rows.push(c)
  }
  Matrix.rows(rows)
end

#to_matrix_gsl ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 567

def to_matrix_gsl
rows=[]
self.each_array{|c|
  rows.push(c)
}
GSL::Matrix.alloc(*rows)
end

#to_multiset_by_split(*fields) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 576

def to_multiset_by_split(*fields)
      require 'statsample/multiset'
      if fields.size==1
to_multiset_by_split_one_field(fields[0])
      else
to_multiset_by_split_multiple_fields(*fields)
      end
end

#to_multiset_by_split_multiple_fields(*fields) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 621

def to_multiset_by_split_multiple_fields(*fields)
  factors_total=nil
  fields.each do |f|
    if factors_total.nil?
      factors_total=@vectors[f].factors.collect{|c|
        [c]
      }
    else
      suma=[]
      factors=@vectors[f].factors
      factors_total.each{|f1| factors.each{|f2| suma.push(f1+[f2]) } }
      factors_total=suma
    end
  end
  ms=Multiset.new_empty_vectors(@fields,factors_total)
  p1=eval "Proc.new {|c| ms[["+fields.collect{|f| "c['#{f}']"}.join(",")+"]].add_case(c,false) }"
  each{|c| p1.call(c)}
  ms.datasets.each do |k,ds|
    ds.update_valid_data
    ds.vectors.each{|k1,v1| v1.type=@vectors[k1].type }
  end
  ms
  
end

#to_multiset_by_split_one_field(field) ⇒ `Object`

Raises:

(ArgumentError)

# File 'lib/statsample/dataset.rb', line 604

def to_multiset_by_split_one_field(field)
  raise ArgumentError,"Should use a correct field name" if !@fields.include? field
  factors=@vectors[field].factors
  ms=Multiset.new_empty_vectors(@fields,factors)
  each {|c|
    ms[c[field]].add_case(c,false)
  }
  #puts "Ingreso a los dataset"
  ms.datasets.each {|k,ds|
    ds.update_valid_data
    ds.vectors.each{|k1,v1|
      #        puts "Vector #{k1}:"+v1.to_s
      v1.type=@vectors[k1].type
    }
  }
  ms
end

#to_s ⇒ `Object`



704
705
706

# File 'lib/statsample/dataset.rb', line 704

def to_s
  "#<"+self.class.to_s+":"+self.object_id.to_s+" @fields=["+@fields.join(",")+"] labels="+@labels.inspect+" cases="+@vectors[@fields[0]].size.to_s
end

#update_valid_data ⇒ `Object`

Check vectors and fields after inserting data. Use only after #add_case_array or #add_case with second parameter to false

# File 'lib/statsample/dataset.rb', line 298

def update_valid_data
  @fields.each{|f| @vectors[f].set_valid_data}
  check_length
end

#vector_by_calculation(type = :scale) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 324

def vector_by_calculation(type=:scale)
  a=[]
  each {|row|
    a.push(yield(row))
  }
  a.to_vector(type)
end

#vector_count_characters(fields = nil) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 360

def vector_count_characters(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0){|a,v|
      a+((@vectors[v].data_with_nils[i].nil?) ? 0: row[v].to_s.size)
    }
  end
end

#vector_mean(fields = nil, max_invalid = 0) ⇒ `Object`

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields

# File 'lib/statsample/dataset.rb', line 372

def vector_mean(fields=nil,max_invalid=0)
  a=[]
  fields=check_fields(fields)
  size=fields.size
  each_with_index do |row, i |
    # numero de invalidos
    sum=0
    invalids=0
    fields.each{|f|
      if !@vectors[f].data_with_nils[i].nil?
        sum+=row[f].to_f
      else
        invalids+=1
      end
    }
    if(invalids>max_invalid)
      a.push(nil)
    else
      a.push(sum.quo(size-invalids))
    end
  end
  a.to_vector(:scale)
end

#vector_missing_values(fields = nil) ⇒ `Object`

Returns a vector with the numbers of missing values for a case

# File 'lib/statsample/dataset.rb', line 352

def vector_missing_values(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0) {|a,v|
      a+ ((@vectors[v].data_with_nils[i].nil?) ? 1: 0)
    }
  end
end

#vector_sum(fields = nil) ⇒ `Object`

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields

# File 'lib/statsample/dataset.rb', line 333

def vector_sum(fields=nil)
  a=[]
  fields||=@fields
  collect_with_index do |row, i|
    if(fields.find{|f| !@vectors[f].data_with_nils[i]})
      nil
    else
      fields.inject(0) {|ac,v| ac + row[v].to_f}
    end
    end
end

#verify(*tests) ⇒ `Object`

Test each row with one or more tests each test is a Proc with the form

Proc.new {|row| row['age']>0}

The function returns an array with all errors

# File 'lib/statsample/dataset.rb', line 681

def verify(*tests)
  if(tests[0].is_a? String)
    id=tests[0]
    tests.shift
  else
    id=@fields[0]
  end
  vr=[]
  i=0
  each do |row|
    i+=1
    tests.each{|test|
      if ! test[2].call(row)
        values=""
        if test[1].size>0
          values=" ("+test[1].collect{|k| "#{k}=#{row[k]}"}.join(", ")+")"
        end
        vr.push("#{i} [#{row[id]}]: #{test[0]}#{values}")
      end
    }
  end
  vr
end

Class: Statsample::Dataset

Overview

Usage

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Writable

Constructor Details

#initialize(vectors = {}, fields = [], labels = {}) ⇒ Dataset

Instance Attribute Details

#cases ⇒ Object (readonly)

#fields ⇒ Object

#i ⇒ Object (readonly)

#labels ⇒ Object

#vectors ⇒ Object (readonly)

Class Method Details

.crosstab_by_asignation(rows, columns, values) ⇒ Object

Instance Method Details

#==(d2) ⇒ Object

#[](i) ⇒ Object

#[]=(i, v) ⇒ Object

#_case_as_array(c) ⇒ Object

#_case_as_hash(c) ⇒ Object

#add_case(v, uvd = true) ⇒ Object

#add_case_array(v) ⇒ Object

#add_vector(name, vector) ⇒ Object

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object

#add_vectors_by_split_recode(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object

#as_r ⇒ Object

#bootstrap(n = nil) ⇒ Object

#case_as_array(i) ⇒ Object

#case_as_hash(i) ⇒ Object

#check_fields(fields) ⇒ Object

#check_length ⇒ Object

#check_order ⇒ Object

#col(c) ⇒ Object Also known as: vector

#collect(type = :scale) ⇒ Object

#collect_matrix ⇒ Object

#collect_with_index(type = :scale) ⇒ Object

#compute(text) ⇒ Object

#crosstab(v1, v2, opts = {}) ⇒ Object

#delete_vector(name) ⇒ Object

#dup(*fields_to_include) ⇒ Object

#dup_empty ⇒ Object

#dup_only_valid ⇒ Object

#each ⇒ Object

#each_array ⇒ Object

#each_array_with_nils ⇒ Object

#each_vector ⇒ Object

#each_with_index ⇒ Object

#filter ⇒ Object

#filter_field(field) ⇒ Object

#from_to(from, to) ⇒ Object

#has_vector?(v) ⇒ Boolean

#inspect ⇒ Object

#label(v_id) ⇒ Object

#merge(other_ds) ⇒ Object

#one_to_many(parent_fields, pattern) ⇒ Object

#recode!(vector_name) ⇒ Object

#standarize ⇒ Object

#summary ⇒ Object

#to_gsl_matrix ⇒ Object

#to_matrix ⇒ Object

#to_matrix_gsl ⇒ Object

#to_multiset_by_split(*fields) ⇒ Object

#to_multiset_by_split_multiple_fields(*fields) ⇒ Object

#to_multiset_by_split_one_field(field) ⇒ Object

#to_s ⇒ Object

#update_valid_data ⇒ Object

#vector_by_calculation(type = :scale) ⇒ Object

#vector_count_characters(fields = nil) ⇒ Object

#vector_mean(fields = nil, max_invalid = 0) ⇒ Object

#vector_missing_values(fields = nil) ⇒ Object

#vector_sum(fields = nil) ⇒ Object

#verify(*tests) ⇒ Object

#initialize(vectors = {}, fields = [], labels = {}) ⇒ `Dataset`

#cases ⇒ `Object` (readonly)

#fields ⇒ `Object`

#i ⇒ `Object` (readonly)

#labels ⇒ `Object`

#vectors ⇒ `Object` (readonly)

.crosstab_by_asignation(rows, columns, values) ⇒ `Object`

#==(d2) ⇒ `Object`

#[](i) ⇒ `Object`

#[]=(i, v) ⇒ `Object`

#_case_as_array(c) ⇒ `Object`

#_case_as_hash(c) ⇒ `Object`

#add_case(v, uvd = true) ⇒ `Object`

#add_case_array(v) ⇒ `Object`

#add_vector(name, vector) ⇒ `Object`

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

#add_vectors_by_split_recode(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

#as_r ⇒ `Object`

#bootstrap(n = nil) ⇒ `Object`

#case_as_array(i) ⇒ `Object`

#case_as_hash(i) ⇒ `Object`

#check_fields(fields) ⇒ `Object`

#check_length ⇒ `Object`

#check_order ⇒ `Object`

#col(c) ⇒ `Object` Also known as: vector

#collect(type = :scale) ⇒ `Object`

#collect_matrix ⇒ `Object`

#collect_with_index(type = :scale) ⇒ `Object`

#compute(text) ⇒ `Object`

#crosstab(v1, v2, opts = {}) ⇒ `Object`

#delete_vector(name) ⇒ `Object`

#dup(*fields_to_include) ⇒ `Object`

#dup_empty ⇒ `Object`

#dup_only_valid ⇒ `Object`

#each ⇒ `Object`

#each_array ⇒ `Object`

#each_array_with_nils ⇒ `Object`

#each_vector ⇒ `Object`

#each_with_index ⇒ `Object`

#filter ⇒ `Object`

#filter_field(field) ⇒ `Object`

#from_to(from, to) ⇒ `Object`

#has_vector?(v) ⇒ `Boolean`

#inspect ⇒ `Object`

#label(v_id) ⇒ `Object`

#merge(other_ds) ⇒ `Object`

#one_to_many(parent_fields, pattern) ⇒ `Object`

#recode!(vector_name) ⇒ `Object`

#standarize ⇒ `Object`

#summary ⇒ `Object`

#to_gsl_matrix ⇒ `Object`

#to_matrix ⇒ `Object`

#to_matrix_gsl ⇒ `Object`

#to_multiset_by_split(*fields) ⇒ `Object`

#to_multiset_by_split_multiple_fields(*fields) ⇒ `Object`

#to_multiset_by_split_one_field(field) ⇒ `Object`

#to_s ⇒ `Object`

#update_valid_data ⇒ `Object`

#vector_by_calculation(type = :scale) ⇒ `Object`

#vector_count_characters(fields = nil) ⇒ `Object`

#vector_mean(fields = nil, max_invalid = 0) ⇒ `Object`

#vector_missing_values(fields = nil) ⇒ `Object`

#vector_sum(fields = nil) ⇒ `Object`

#verify(*tests) ⇒ `Object`