Class: Statsample::Dataset

Inherits:

Object

Object
Statsample::Dataset

show all

Includes:: Summarizable, Writable

Defined in:: lib/statsample/dataset.rb,
lib/statsample/rserve_extension.rb

Overview

Set of cases with values for one or more variables, analog to a dataframe on R or a standard data file of SPSS. Every vector has #field name, which represent it. By default, the vectors are ordered by it field name, but you can change it the fields order manually. The Dataset work as a Hash, with keys are field names and values are Statsample::Vector

Usage

Create a empty dataset:

Dataset.new()

Create a dataset with three empty vectors, called v1, v2 and v3:

Dataset.new(%w{v1 v2 v3})

Create a dataset with two vectors, called v1 and v2:

Dataset.new({'v1'=>%w{1 2 3}.to_vector, 'v2'=>%w{4 5 6}.to_vector})

Create a dataset with two given vectors (v1 and v2), with vectors on inverted order:

Dataset.new({'v2'=>v2,'v1'=>v1},['v2','v1'])

The fast way to create a dataset uses Hash#to_dataset, with field order as arguments

v1 = [1,2,3].to_numeric
v2 = [1,2,3].to_numeric
ds = {'v1'=>v2, 'v2'=>v2}.to_dataset(%w{v2 v1})

Instance Attribute Summary collapse

#cases ⇒ Object readonly

Number of cases.
#fields ⇒ Object

Ordered ids of vectors.
#i ⇒ Object readonly

Location of pointer on enumerations methods (like #each).
#name ⇒ Object

Name of dataset.
#vectors ⇒ Object readonly

Hash of Statsample::Vector.

Class Method Summary collapse

.crosstab_by_asignation(rows, columns, values) ⇒ Object

Generates a new dataset, using three vectors - Rows - Columns - Values.

Instance Method Summary collapse

#==(d2) ⇒ Boolean

We have the same datasets if vectors and fields are the same.
#[](i) ⇒ Object

Returns the vector named i.
#[]=(i, v) ⇒ Object
#_case_as_array(c) ⇒ Object

:nodoc:.
#_case_as_hash(c) ⇒ Object

:nodoc:.
#add_case(v, uvd = true) ⇒ Object

Insert a case, using: * Array: size equal to number of vectors and values in the same order as fields * Hash: keys equal to fields If uvd is false, #update_valid_data is not executed after inserting a case.
#add_case_array(v) ⇒ Object

Fast version of #add_case.
#add_vector(name, vector) ⇒ Object

Equal to Dataset[name]=vector.
#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object
#add_vectors_by_split_recode(name_, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object
#bootstrap(n = nil) ⇒ Statsample::Dataset

Creates a dataset with the random data, of a n size If n not given, uses original number of cases.
#case_as_array(i) ⇒ Object

Retrieves case i as a array, ordered on #fields order.
#case_as_hash(i) ⇒ Object

Retrieves case i as a hash.
#check_fields(fields) ⇒ Object

Check if #fields attribute is correct, after inserting or deleting vectors.
#check_length ⇒ Object

Check vectors for type and size.
#check_order ⇒ Object

Check congruence between fields attribute and keys on +vectors.
#clear_gsl ⇒ Object
#clone(*fields_to_include) ⇒ Statsample::Dataset

Returns a shallow copy of Dataset.
#clone_only_valid(*fields_to_include) ⇒ Statsample::Dataset

Returns (when possible) a cheap copy of dataset.
#col(c) ⇒ Statsample::Vector (also: #vector)

Returns vector c.
#collect(type = :numeric) ⇒ Object

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.
#collect_matrix ⇒ ::Matrix

Generate a matrix, based on fields of dataset.
#collect_with_index(type = :numeric) ⇒ Object

Same as Statsample::Vector.collect, but giving case index as second parameter on yield.
#compute(text) ⇒ Object

Returns a vector, based on a string with a calculation based on vector The calculation will be eval’ed, so you can put any variable or expression valid on ruby For example: a=.to_vector(scale) b=.to_vector(scale) ds=‘a’=>a,‘b’=>b.to_dataset ds.compute(“a+b”) => Vector [4,6].
#correlation_matrix(fields = nil) ⇒ Object

Return a correlation matrix for fields included as parameters.
#covariance_matrix(fields = nil) ⇒ Object

Return a correlation matrix for fields included as parameters.
#crosstab(v1, v2, opts = {}) ⇒ Object
#delete_vector(*args) ⇒ Object

Delete vector named name.
#dup(*fields_to_include) ⇒ Statsample::Dataset

Returns a duplicate of the Dataset.
#dup_empty ⇒ Statsample::Dataset

Creates a copy of the given dataset, without data on vectors.
#dup_only_valid(*fields_to_include) ⇒ Object

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors.
#each ⇒ Object

Returns each case as a hash.
#each_array ⇒ Object

Returns each case as an array.
#each_array_with_nils ⇒ Object

Returns each case as an array, coding missing values as nils.
#each_vector ⇒ Object

Retrieves each vector as [key, vector].
#each_with_index ⇒ Object

Returns each case as hash and index.
#filter ⇒ Object

Create a new dataset with all cases which the block returns true.
#filter_field(field) ⇒ Object

creates a new vector with the data of a given field which the block returns true.
#from_to(from, to) ⇒ Object

Returns an array with the fields from first argumen to last argument.
#has_missing_data? ⇒ Boolean

Return true if any vector has missing data.
#has_vector?(v) ⇒ Boolean

Returns true if dataset have vector v.
#initialize(vectors = {}, fields = []) ⇒ Dataset constructor

Creates a new dataset.
#inspect ⇒ Object
#join(other_ds, fields_1 = [], fields_2 = [], type = :left) ⇒ Statsample::Dataset

Join 2 Datasets by given fields type is one of :left and :inner, default is :left.
#merge(other_ds) ⇒ Statsample::Dataset

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…
#nest(*tree_keys, &block) ⇒ Object

Return a nested hash using fields as keys and an array constructed of hashes with other values.
#one_to_many(parent_fields, pattern) ⇒ Object

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.
#recode!(vector_name) ⇒ Object

Recode a vector based on a block.
#report_building(b) ⇒ Object
#standarize ⇒ Statsample::Dataset

Returns a dataset with standarized data.
#to_gsl ⇒ Object
#to_matrix ⇒ Object

Return data as a matrix.
#to_multiset_by_split(*fields) ⇒ Object
#to_multiset_by_split_multiple_fields(*fields) ⇒ Object
#to_multiset_by_split_one_field(field) ⇒ Object

Creates a Statsample::Multiset, using one field.
#to_REXP ⇒ Object
#to_s ⇒ Object
#update_valid_data ⇒ Object

Check vectors and fields after inserting data.
#vector_by_calculation(type = :numeric) ⇒ Object
#vector_count_characters(fields = nil) ⇒ Object
#vector_mean(fields = nil, max_invalid = 0) ⇒ Object

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields.
#vector_missing_values(fields = nil) ⇒ Object

Returns a vector with the numbers of missing values for a case.
#vector_sum(fields = nil) ⇒ Object

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields.
#verify(*tests) ⇒ Object

Test each row with one or more tests each test is a Proc with the form Proc.new {|row| row>0} The function returns an array with all errors.

Methods included from Summarizable

#summary

Methods included from Writable

#save

Constructor Details

#initialize(vectors = {}, fields = []) ⇒ `Dataset`

Creates a new dataset. A dataset is a set of ordered named vectors of the same size.

vectors: With an array, creates a set of empty vectors named as

values on the array. With a hash, each Vector is assigned as a variable of the Dataset named as its key

fields: Array of names for vectors. Is only used for set the

order of variables. If empty, vectors keys on alfabethic order as used as fields.

# File 'lib/statsample/dataset.rb', line 158

def initialize(vectors={}, fields=[])
  @@n_dataset||=0
  @@n_dataset+=1
  @name=_("Dataset %d") % @@n_dataset
  @cases=0
  @gsl=nil
  @i=nil

  if vectors.instance_of? Array
    @fields=vectors.dup
    @vectors=vectors.inject({}){|a,x| a[x]=Statsample::Vector.new(); a}
  else
    # Check vectors
    @vectors=vectors
    @fields=fields
    check_order
    check_length
  end
end

Instance Attribute Details

#cases ⇒ `Object` (readonly)

Number of cases



69
70
71

# File 'lib/statsample/dataset.rb', line 69

def cases
  @cases
end

#fields ⇒ `Object`

Ordered ids of vectors



65
66
67

# File 'lib/statsample/dataset.rb', line 65

def fields
  @fields
end

#i ⇒ `Object` (readonly)

Location of pointer on enumerations methods (like #each)



71
72
73

# File 'lib/statsample/dataset.rb', line 71

def i
  @i
end

#name ⇒ `Object`

Name of dataset



67
68
69

# File 'lib/statsample/dataset.rb', line 67

def name
  @name
end

#vectors ⇒ `Object` (readonly)

Hash of Statsample::Vector



63
64
65

# File 'lib/statsample/dataset.rb', line 63

def vectors
  @vectors
end

Class Method Details

.crosstab_by_asignation(rows, columns, values) ⇒ `Object`

Generates a new dataset, using three vectors

Rows
Columns
Values

For example, you have these values

x   y   v
a   a   0
a   b   1
b   a   1
b   b   0

You obtain

id  a   b
 a  0   1
 b  1   0

Useful to process outputs from databases

# File 'lib/statsample/dataset.rb', line 92

def self.crosstab_by_asignation(rows,columns,values)
  raise "Three vectors should be equal size" if rows.size!=columns.size or rows.size!=values.size
  cols_values=columns.factors
  cols_n=cols_values.size
  h_rows=rows.factors.inject({}){|a,v| a[v]=cols_values.inject({}){
    |a1,v1| a1[v1]=nil; a1
    }
    ;a}
  values.each_index{|i|
    h_rows[rows[i]][columns[i]]=values[i]
  }
  ds=Dataset.new(["_id"]+cols_values)
  cols_values.each{|c|
    ds[c].type=values.type
  }
  rows.factors.each {|row|
    n_row=Array.new(cols_n+1)
    n_row[0]=row
      cols_values.each_index {|i|
        n_row[i+1]=h_rows[row][cols_values[i]]
    }
    ds.add_case_array(n_row)
  }
  ds.update_valid_data
  ds
end

Instance Method Details

#==(d2) ⇒ `Boolean`

We have the same datasets if vectors and fields are the same

Returns:

(Boolean)



370
371
372

# File 'lib/statsample/dataset.rb', line 370

def ==(d2)
  @vectors==d2.vectors and @fields==d2.fields
end

#[](i) ⇒ `Object`

Returns the vector named i

# File 'lib/statsample/dataset.rb', line 670

def[](i)
  if i.is_a? Range
    fields=from_to(i.begin,i.end)
    clone(*fields)
  elsif i.is_a? Array
    clone(i)
  else
    raise Exception,"Vector '#{i}' doesn't exists on dataset" unless @vectors.has_key?(i)
    @vectors[i]
  end
end

#[]=(i, v) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 709

def[]=(i,v)
  if v.instance_of? Statsample::Vector
    @vectors[i]=v
    check_order
  else
    raise ArgumentError,"Should pass a Statsample::Vector"
  end
end

#_case_as_array(c) ⇒ `Object`

:nodoc:



598
599
600

# File 'lib/statsample/dataset.rb', line 598

def _case_as_array(c) # :nodoc:
  @fields.collect {|x| @vectors[x][c]}
end

#_case_as_hash(c) ⇒ `Object`

:nodoc:



595
596
597

# File 'lib/statsample/dataset.rb', line 595

def _case_as_hash(c) # :nodoc:
  @fields.inject({}) {|a,x| a[x]=@vectors[x][c];a }
end

#add_case(v, uvd = true) ⇒ `Object`

Insert a case, using:

Array: size equal to number of vectors and values in the same order as fields
Hash: keys equal to fields

If uvd is false, #update_valid_data is not executed after inserting a case. This is very useful if you want to increase the performance on inserting many cases, because #update_valid_data performs check on vectors and on the dataset

# File 'lib/statsample/dataset.rb', line 424

def add_case(v,uvd=true)
  case v
  when Array
    if (v[0].is_a? Array)
      v.each{|subv| add_case(subv,false)}
    else
      raise ArgumentError, "Input array size (#{v.size}) should be equal to fields number (#{@fields.size})" if @fields.size!=v.size
      v.each_index {|i| @vectors[@fields[i]].add(v[i],false)}
    end
  when Hash
    raise ArgumentError, "Hash keys should be equal to fields #{(v.keys - @fields).join(",")}" if @fields.sort!=v.keys.sort
    @fields.each{|f| @vectors[f].add(v[f],false)}
  else
    raise TypeError, 'Value must be a Array or a Hash'
  end
  if uvd
    update_valid_data
  end
end

#add_case_array(v) ⇒ `Object`

Fast version of #add_case. Can only add one case and no error check if performed You SHOULD use #update_valid_data at the end of insertion cycle



413
414
415

# File 'lib/statsample/dataset.rb', line 413

def add_case_array(v)
  v.each_index {|i| d=@vectors[@fields[i]].data; d.push(v[i])}
end

#add_vector(name, vector) ⇒ `Object`

Equal to Dataset[name]=vector

Returns:

self

Raises:

(ArgumentError)

# File 'lib/statsample/dataset.rb', line 383

def add_vector(name, vector)
  raise ArgumentError, "Vector have different size" if vector.size!=@cases
  @vectors[name]=vector
  check_order
  self
end

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 473

def add_vectors_by_split(name,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name].split_by_separator(sep)
  split.each{|k,v|
    add_vector(name+join+k,v)
  }
end

#add_vectors_by_split_recode(name_, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 463

def add_vectors_by_split_recode(name_,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name_].split_by_separator(sep)
  i=1
  split.each{|k,v|
    new_field=name_+join+i.to_s
    v.name=name_+":"+k
    add_vector(new_field,v)
    i+=1
  }
end

#bootstrap(n = nil) ⇒ `Statsample::Dataset`

Creates a dataset with the random data, of a n size If n not given, uses original number of cases.

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 399

def bootstrap(n=nil)
  n||=@cases
  ds_boot=dup_empty
  n.times do
    ds_boot.add_case_array(case_as_array(rand(n)))
  end
  ds_boot.update_valid_data
  ds_boot
end

#case_as_array(i) ⇒ `Object`

Retrieves case i as a array, ordered on #fields order



586
587
588

# File 'lib/statsample/dataset.rb', line 586

def case_as_array(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_array(self,c)
end

#case_as_hash(i) ⇒ `Object`

Retrieves case i as a hash



575
576
577

# File 'lib/statsample/dataset.rb', line 575

def case_as_hash(c) # :nodoc:
  Statsample::STATSAMPLE__.case_as_hash(self,c)
end

#check_fields(fields) ⇒ `Object`

Check if #fields attribute is correct, after inserting or deleting vectors

# File 'lib/statsample/dataset.rb', line 502

def check_fields(fields)
  fields||=@fields
  raise "Fields #{(fields-@fields).join(", ")} doesn't exists on dataset" if (fields-@fields).size>0
  fields
end

#check_length ⇒ `Object`

Check vectors for type and size.

# File 'lib/statsample/dataset.rb', line 555

def check_length # :nodoc:
  size=nil
  @vectors.each do |k,v|
    raise Exception, "Data #{v.class} is not a vector on key #{k}" if !v.is_a? Statsample::Vector
    if size.nil?
      size=v.size
    else
      if v.size!=size
        raise Exception, "Vector #{k} have size #{v.size} and dataset have size #{size}"
      end
    end
  end
  @cases=size
end

#check_order ⇒ `Object`

Check congruence between fields attribute and keys on +vectors

# File 'lib/statsample/dataset.rb', line 663

def check_order #:nodoc:
  if(@vectors.keys.sort!=@fields.sort)
    @fields=@fields&@vectors.keys
    @fields+=@vectors.keys.sort-@fields
  end
end

#clear_gsl ⇒ `Object`



728
729
730

# File 'lib/statsample/dataset.rb', line 728

def clear_gsl
  @gsl=nil
end

#clone(*fields_to_include) ⇒ `Statsample::Dataset`

Returns a shallow copy of Dataset. Object id will be distinct, but @vectors will be the same.

Parameters:

array —

of fields to include. No value include all fields

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 257

def clone(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields.dup if fields_to_include.size==0
  ds=Dataset.new
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    ds[f]=@vectors[f]
  }
  ds.fields=fields_to_include
  ds.name=@name
  ds.update_valid_data
  ds
end

#clone_only_valid(*fields_to_include) ⇒ `Statsample::Dataset`

Returns (when possible) a cheap copy of dataset. If no vector have missing values, returns original vectors. If missing values presents, uses Dataset.dup_only_valid.

Parameters:

array —

of fields to include. No value include all fields

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 242

def clone_only_valid(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields.dup if fields_to_include.size==0
  if fields_to_include.any? {|v| @vectors[v].has_missing_data?}
    dup_only_valid(fields_to_include)
  else
    clone(fields_to_include)
  end
end

#col(c) ⇒ `Statsample::Vector` Also known as: vector

Returns vector c

Returns:

(Statsample::Vector)



376
377
378

# File 'lib/statsample/dataset.rb', line 376

def col(c)
  @vectors[c]
end

#collect(type = :numeric) ⇒ `Object`

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.

# File 'lib/statsample/dataset.rb', line 683

def collect(type=:numeric)
  data=[]
  each {|row|
    data.push yield(row)
  }
  Statsample::Vector.new(data,type)
end

#collect_matrix ⇒ `::Matrix`

Generate a matrix, based on fields of dataset

Returns:

(::Matrix)

# File 'lib/statsample/dataset.rb', line 358

def collect_matrix
  rows=@fields.collect{|row|
    @fields.collect{|col|
      yield row,col
    }
  }
  Matrix.rows(rows)
end

#collect_with_index(type = :numeric) ⇒ `Object`

Same as Statsample::Vector.collect, but giving case index as second parameter on yield.

# File 'lib/statsample/dataset.rb', line 691

def collect_with_index(type=:numeric)
  data=[]
  each_with_index {|row, i|
    data.push(yield(row, i))
  }
  Statsample::Vector.new(data,type)
end

#compute(text) ⇒ `Object`

Returns a vector, based on a string with a calculation based on vector The calculation will be eval’ed, so you can put any variable or expression valid on ruby For example:

a=[1,2].to_vector(scale)
b=[3,4].to_vector(scale)
ds={'a'=>a,'b'=>b}.to_dataset
ds.compute("a+b")
=> Vector [4,6]

# File 'lib/statsample/dataset.rb', line 870

def compute(text)
  @fields.each{|f|
    if @vectors[f].type=:numeric
      text.gsub!(f,"row['#{f}'].to_f")
    else
      text.gsub!(f,"row['#{f}']")
    end
  }
  collect_with_index {|row, i|
    invalid=false
    @fields.each{|f|
      if @vectors[f].data_with_nils[i].nil?
        invalid=true
      end
    }
    if invalid
      nil
    else
      eval(text)
    end
  }
end

#correlation_matrix(fields = nil) ⇒ `Object`

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset

# File 'lib/statsample/dataset.rb', line 749

def correlation_matrix(fields = nil)
  if fields
    ds = clone(fields)
  else
    ds = self
  end
  Statsample::Bivariate.correlation_matrix(ds)
end

#covariance_matrix(fields = nil) ⇒ `Object`

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset

# File 'lib/statsample/dataset.rb', line 760

def covariance_matrix(fields = nil)
  if fields
    ds = clone(fields)
  else
    ds = self
  end
  Statsample::Bivariate.covariance_matrix(ds)
end

#crosstab(v1, v2, opts = {}) ⇒ `Object`



706
707
708

# File 'lib/statsample/dataset.rb', line 706

def crosstab(v1,v2,opts={})
  Statsample::Crosstab.new(@vectors[v1], @vectors[v2],opts)
end

#delete_vector(*args) ⇒ `Object`

Delete vector named name. Multiple fields accepted.

# File 'lib/statsample/dataset.rb', line 451

def delete_vector(*args)
  if args.size==1 and args[0].is_a? Array
    names=args[0]
  else
    names=args
  end
  names.each do |name|
    @fields.delete(name)
    @vectors.delete(name)
  end
end

#dup(*fields_to_include) ⇒ `Statsample::Dataset`

Returns a duplicate of the Dataset. All vectors are copied, so any modification on new dataset doesn’t affect original dataset’s vectors. If fields given as parameter, only include those vectors.

Parameters:

array —

of fields to include. No value include all fields

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 211

def dup(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  vectors={}
  fields=[]
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    vectors[f]=@vectors[f].dup
    fields.push(f)
  }
  ds=Dataset.new(vectors,fields)
  ds.name= self.name
  ds
end

#dup_empty ⇒ `Statsample::Dataset`

Creates a copy of the given dataset, without data on vectors

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 275

def dup_empty
  vectors=@vectors.inject({}) {|a,v|
    a[v[0]]=v[1].dup_empty
    a
  }
  Dataset.new(vectors,@fields.dup)
end

#dup_only_valid(*fields_to_include) ⇒ `Object`

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors.

Parameters:

array —

of fields to include. No value include all fields

# File 'lib/statsample/dataset.rb', line 183

def dup_only_valid(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  if fields_to_include.any? {|f| @vectors[f].has_missing_data?}
    ds=Dataset.new(fields_to_include)
    fields_to_include.each {|f| ds[f].type=@vectors[f].type}
    each {|row|
      unless fields_to_include.any? {|f| @vectors[f].has_missing_data? and !@vectors[f].is_valid? row[f]}
        row_2=fields_to_include.inject({}) {|ac,v| ac[v]=row[v]; ac}
        ds.add_case(row_2)
      end
    }
  else
    ds=dup fields_to_include
  end
  ds.name= self.name
  ds
end

#each ⇒ `Object`

Returns each case as a hash

# File 'lib/statsample/dataset.rb', line 603

def each
  begin
    @i=0
    @cases.times {|i|
      @i=i
      row=case_as_hash(i)
      yield row
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#each_array ⇒ `Object`

Returns each case as an array

# File 'lib/statsample/dataset.rb', line 647

def each_array
  @cases.times {|i|
    @i=i
    row=case_as_array(i)
    yield row
  }
  @i=nil
end

#each_array_with_nils ⇒ `Object`

Returns each case as an array, coding missing values as nils

# File 'lib/statsample/dataset.rb', line 633

def each_array_with_nils
  m=fields.size
  @cases.times {|i|
    @i=i
    row=Array.new(m)
    fields.each_index{|j|
      f=fields[j]
      row[j]=@vectors[f].data_with_nils[i]
    }
    yield row
  }
  @i=nil
end

#each_vector ⇒ `Object`

Retrieves each vector as [key, vector]



570
571
572

# File 'lib/statsample/dataset.rb', line 570

def each_vector # :yield: |key, vector|
  @fields.each{|k| yield k, @vectors[k]}
end

#each_with_index ⇒ `Object`

Returns each case as hash and index

# File 'lib/statsample/dataset.rb', line 618

def each_with_index # :yield: |case, i|
  begin
    @i=0
    @cases.times{|i|
      @i=i
      row=case_as_hash(i)
      yield row, i
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

#filter ⇒ `Object`

Create a new dataset with all cases which the block returns true

# File 'lib/statsample/dataset.rb', line 770

def filter
  ds=self.dup_empty
  each {|c|
    ds.add_case(c, false) if yield c
  }
  ds.update_valid_data
  ds.name=_("%s(filtered)") % @name
  ds
end

#filter_field(field) ⇒ `Object`

creates a new vector with the data of a given field which the block returns true

# File 'lib/statsample/dataset.rb', line 781

def filter_field(field)
  a=[]
  each do |c|
    a.push(c[field]) if yield c
  end
  a.to_vector(@vectors[field].type)
end

#from_to(from, to) ⇒ `Object`

Returns an array with the fields from first argumen to last argument

Raises:

(ArgumentError)

# File 'lib/statsample/dataset.rb', line 230

def from_to(from,to)
  raise ArgumentError, "Field #{from} should be on dataset" if !@fields.include? from
  raise ArgumentError, "Field #{to} should be on dataset" if !@fields.include? to
  @fields.slice(@fields.index(from)..@fields.index(to))
end

#has_missing_data? ⇒ `Boolean`

Return true if any vector has missing data

Returns:

(Boolean)



119
120
121

# File 'lib/statsample/dataset.rb', line 119

def has_missing_data?
  @vectors.any? {|k,v| v.has_missing_data?}
end

#has_vector?(v) ⇒ `Boolean`

Returns true if dataset have vector v.

Returns:

(Boolean)



392
393
394

# File 'lib/statsample/dataset.rb', line 392

def has_vector? (v)
  return @vectors.has_key?(v)
end

#inspect ⇒ `Object`



922
923
924

# File 'lib/statsample/dataset.rb', line 922

def inspect
  self.to_s
end

#join(other_ds, fields_1 = [], fields_2 = [], type = :left) ⇒ `Statsample::Dataset`

Join 2 Datasets by given fields type is one of :left and :inner, default is :left

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 308

def join(other_ds,fields_1=[],fields_2=[],type=:left)
  fields_new = other_ds.fields - fields_2
  fields = self.fields + fields_new

  other_ds_hash = {}
  other_ds.each do |row|
    key = row.select{|k,v| fields_2.include?(k)}.values
    value = row.select{|k,v| fields_new.include?(k)}
    if other_ds_hash[key].nil?
      other_ds_hash[key] = [value]
    else
      other_ds_hash[key] << value
    end
  end

  new_ds = Dataset.new(fields)

  self.each do |row|
    key = row.select{|k,v| fields_1.include?(k)}.values

    new_case = row.dup

    if other_ds_hash[key].nil?
      if type == :left
        fields_new.each{|field| new_case[field] = nil}
        new_ds.add_case(new_case)
      end
    else
      other_ds_hash[key].each do |new_values|
        new_ds.add_case new_case.merge(new_values)
      end
    end

  end
  new_ds
end

#merge(other_ds) ⇒ `Statsample::Dataset`

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 287

def merge(other_ds)
  raise "Cases should be equal (this:#{@cases}; other:#{other_ds.cases}" unless @cases==other_ds.cases
  types = @fields.collect{|f| @vectors[f].type} + other_ds.fields.collect{|f| other_ds[f].type}
  new_fields = (@fields+other_ds.fields).recode_repeated
  ds_new=Statsample::Dataset.new(new_fields)
  new_fields.each_index{|i|
    field=new_fields[i]
    ds_new[field].type=types[i]
  }
  @cases.times {|i|
    row=case_as_array(i)+other_ds.case_as_array(i)
    ds_new.add_case_array(row)
  }
  ds_new.update_valid_data
  ds_new
end

#nest(*tree_keys, &block) ⇒ `Object`

Return a nested hash using fields as keys and an array constructed of hashes with other values. If block provided, is used to provide the values, with parameters row of dataset, current last hash on hierarchy and name of the key to include

# File 'lib/statsample/dataset.rb', line 128

def nest(*tree_keys,&block)
  tree_keys=tree_keys[0] if tree_keys[0].is_a? Array
  out=Hash.new
  each do |row|
    current=out
    # Create tree
    tree_keys[0,tree_keys.size-1].each do |f|
      root=row[f]
      current[root]||=Hash.new
      current=current[root]
    end
    name=row[tree_keys.last]
    if !block
      current[name]||=Array.new
      current[name].push(row.delete_if{|key,value| tree_keys.include? key})
    else
      current[name]=block.call(row, current,name)
    end
  end
  out
end

#one_to_many(parent_fields, pattern) ⇒ `Object`

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.

for example, you have a survey for number of children with this structure:

id, name, child_name_1, child_age_1, child_name_2, child_age_2

with

ds.one_to_many(%w{id}, "child_%v_%n"

the field of first parameters will be copied verbatim to new dataset, and fields which responds to second pattern will be added one case for each different %n. For example

cases=[
  ['1','george','red',10,'blue',20,nil,nil],
  ['2','fred','green',15,'orange',30,'white',20],
  ['3','alfred',nil,nil,nil,nil,nil,nil]
]
ds=Statsample::Dataset.new(%w{id name car_color1 car_value1 car_color2 car_value2 car_color3 car_value3})
cases.each {|c| ds.add_case_array c }
ds.one_to_many(['id'],'car_%v%n').to_matrix
=> Matrix[
   ["red", "1", 10],
   ["blue", "1", 20],
   ["green", "2", 15],
   ["orange", "2", 30],
   ["white", "2", 20]
   ]

# File 'lib/statsample/dataset.rb', line 953

def one_to_many(parent_fields, pattern)
  #base_pattern=pattern.gsub(/%v|%n/,"")
  re=Regexp.new pattern.gsub("%v","(.+?)").gsub("%n","(\\d+?)")
  ds_vars=parent_fields
  vars=[]
  max_n=0
  h=parent_fields.inject({}) {|a,v| a[v]=Statsample::Vector.new([], @vectors[v].type);a }
  # Adding _row_id
  h['_col_id']=[].to_numeric
  ds_vars.push("_col_id")
  @fields.each do |f|
    if f=~re
      if !vars.include? $1
        vars.push($1)
        h[$1]=Statsample::Vector.new([], @vectors[f].type)
      end
      max_n=$2.to_i if max_n < $2.to_i
    end
  end
  ds=Dataset.new(h,ds_vars+vars)
  each do |row|
    row_out={}
    parent_fields.each do |f|
      row_out[f]=row[f]
    end
    max_n.times do |n1|
      n=n1+1
      any_data=false
      vars.each do |v|
        data=row[pattern.gsub("%v",v.to_s).gsub("%n",n.to_s)]
        row_out[v]=data
        any_data=true if !data.nil?
      end
      if any_data
        row_out["_col_id"]=n
        ds.add_case(row_out,false)
      end

    end
  end
  ds.update_valid_data
  ds
end

#recode!(vector_name) ⇒ `Object`

Recode a vector based on a block

# File 'lib/statsample/dataset.rb', line 699

def recode!(vector_name)
  0.upto(@cases-1) {|i|
    @vectors[vector_name].data[i]=yield case_as_hash(i)
  }
  @vectors[vector_name].set_valid_data
end

#report_building(b) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 996

def report_building(b)
  b.section(:name=>@name) do |g|
    g.text _"Cases: %d"  % cases
    @fields.each do |f|
      g.text "Element:[#{f}]"
      g.parse_element(@vectors[f])
    end
  end
end

#standarize ⇒ `Statsample::Dataset`

Returns a dataset with standarized data.

Returns:

(Statsample::Dataset)

# File 'lib/statsample/dataset.rb', line 347

def standarize
  ds=dup()
  ds.fields.each do |f|
    ds[f]=ds[f].vector_standarized
  end
  ds
end

#to_gsl ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 732

def to_gsl
  if @gsl.nil?
    if cases.nil?
      update_valid_data
    end
    @gsl=GSL::Matrix.alloc(cases,fields.size)
    self.each_array{|c|
      @gsl.set_row(@i,c)
    }
  end
  @gsl
end

#to_matrix ⇒ `Object`

Return data as a matrix. Column are ordered by #fields and rows by orden of insertion

# File 'lib/statsample/dataset.rb', line 719

def to_matrix
  rows=[]
  self.each_array{|c|
    rows.push(c)
  }
  Matrix.rows(rows)
end

#to_multiset_by_split(*fields) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 793

def to_multiset_by_split(*fields)
			require 'statsample/multiset'
			if fields.size==1
to_multiset_by_split_one_field(fields[0])
			else
to_multiset_by_split_multiple_fields(*fields)
			end
end

#to_multiset_by_split_multiple_fields(*fields) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 824

def to_multiset_by_split_multiple_fields(*fields)
  factors_total=nil
  fields.each do |f|
    if factors_total.nil?
      factors_total=@vectors[f].factors.collect{|c|
        [c]
      }
    else
      suma=[]
      factors=@vectors[f].factors
      factors_total.each{|f1| factors.each{|f2| suma.push(f1+[f2]) } }
      factors_total=suma
    end
  end
  ms=Multiset.new_empty_vectors(@fields,factors_total)

  p1=eval "Proc.new {|c| ms[["+fields.collect{|f| "c['#{f}']"}.join(",")+"]].add_case(c,false) }"
  each{|c| p1.call(c)}

  ms.datasets.each do |k,ds|
    ds.update_valid_data
    ds.name=fields.size.times.map {|i|
      f=fields[i]
      sk=k[i]
      @vectors[f].labeling(sk)
    }.join("-")
    ds.vectors.each{|k1,v1|
      v1.type=@vectors[k1].type
      v1.name=@vectors[k1].name
      v1.labels=@vectors[k1].labels

    }
  end
  ms

end

#to_multiset_by_split_one_field(field) ⇒ `Object`

Creates a Statsample::Multiset, using one field

Raises:

(ArgumentError)

# File 'lib/statsample/dataset.rb', line 803

def to_multiset_by_split_one_field(field)
  raise ArgumentError,"Should use a correct field name" if !@fields.include? field
  factors=@vectors[field].factors
  ms=Multiset.new_empty_vectors(@fields, factors)
  each {|c|
    ms[c[field]].add_case(c,false)
  }
  #puts "Ingreso a los dataset"
  ms.datasets.each {|k,ds|
    ds.update_valid_data
    ds.name=@vectors[field].labeling(k)
    ds.vectors.each{|k1,v1|
      #        puts "Vector #{k1}:"+v1.to_s
      v1.type=@vectors[k1].type
      v1.name=@vectors[k1].name
      v1.labels=@vectors[k1].labels

    }
  }
  ms
end

#to_REXP ⇒ `Object`

# File 'lib/statsample/rserve_extension.rb', line 11

def to_REXP
  names=@fields
  data=@fields.map {|f|
    Rserve::REXP::Wrapper.wrap(@vectors[f].data_with_nils)
  }
  l=Rserve::Rlist.new(data,names)
  Rserve::REXP.create_data_frame(l)
end

#to_s ⇒ `Object`



919
920
921

# File 'lib/statsample/dataset.rb', line 919

def to_s
  "#<"+self.class.to_s+":"+self.object_id.to_s+" @name=#{@name} @fields=["+@fields.join(",")+"] cases="+@vectors[@fields[0]].size.to_s
end

#update_valid_data ⇒ `Object`

Check vectors and fields after inserting data. Use only after #add_case_array or #add_case with second parameter to false

# File 'lib/statsample/dataset.rb', line 445

def update_valid_data
  @gsl=nil
  @fields.each{|f| @vectors[f].set_valid_data}
  check_length
end

#vector_by_calculation(type = :numeric) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 480

def vector_by_calculation(type=:numeric)
  a=[]
  each do |row|
    a.push(yield(row))
  end
  a.to_vector(type)
end

#vector_count_characters(fields = nil) ⇒ `Object`

# File 'lib/statsample/dataset.rb', line 517

def vector_count_characters(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0){|a,v|
      a+((@vectors[v].data_with_nils[i].nil?) ? 0: row[v].to_s.size)
    }
  end
end

#vector_mean(fields = nil, max_invalid = 0) ⇒ `Object`

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields

# File 'lib/statsample/dataset.rb', line 529

def vector_mean(fields=nil, max_invalid=0)
  a=[]
  fields=check_fields(fields)
  size=fields.size
  each_with_index do |row, i |
    # numero de invalidos
    sum=0
    invalids=0
    fields.each{|f|
      if !@vectors[f].data_with_nils[i].nil?
        sum+=row[f].to_f
      else
        invalids+=1
      end
    }
    if(invalids>max_invalid)
      a.push(nil)
    else
      a.push(sum.quo(size-invalids))
    end
  end
  a=a.to_vector(:numeric)
  a.name=_("Means from %s") % @name
  a
end

#vector_missing_values(fields = nil) ⇒ `Object`

Returns a vector with the numbers of missing values for a case

# File 'lib/statsample/dataset.rb', line 509

def vector_missing_values(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0) {|a,v|
      a+ ((@vectors[v].data_with_nils[i].nil?) ? 1: 0)
    }
  end
end

#vector_sum(fields = nil) ⇒ `Object`

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields

# File 'lib/statsample/dataset.rb', line 489

def vector_sum(fields=nil)
  fields||=@fields
  vector=collect_with_index do |row, i|
    if(fields.find{|f| !@vectors[f].data_with_nils[i]})
      nil
    else
      fields.inject(0) {|ac,v| ac + row[v].to_f}
    end
  end
  vector.name=_("Sum from %s") % @name
  vector
end

#verify(*tests) ⇒ `Object`

Test each row with one or more tests each test is a Proc with the form

Proc.new {|row| row['age']>0}

The function returns an array with all errors

# File 'lib/statsample/dataset.rb', line 896

def verify(*tests)
  if(tests[0].is_a? String)
    id=tests[0]
    tests.shift
  else
    id=@fields[0]
  end
  vr=[]
  i=0
  each do |row|
    i+=1
    tests.each{|test|
      if ! test[2].call(row)
        values=""
        if test[1].size>0
          values=" ("+test[1].collect{|k| "#{k}=#{row[k]}"}.join(", ")+")"
        end
        vr.push("#{i} [#{row[id]}]: #{test[0]}#{values}")
      end
    }
  end
  vr
end

Class: Statsample::Dataset

Overview

Usage

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Summarizable

Methods included from Writable

Constructor Details

#initialize(vectors = {}, fields = []) ⇒ Dataset

Instance Attribute Details

#cases ⇒ Object (readonly)

#fields ⇒ Object

#i ⇒ Object (readonly)

#name ⇒ Object

#vectors ⇒ Object (readonly)

Class Method Details

.crosstab_by_asignation(rows, columns, values) ⇒ Object

Instance Method Details

#==(d2) ⇒ Boolean

#[](i) ⇒ Object

#[]=(i, v) ⇒ Object

#_case_as_array(c) ⇒ Object

#_case_as_hash(c) ⇒ Object

#add_case(v, uvd = true) ⇒ Object

#add_case_array(v) ⇒ Object

#add_vector(name, vector) ⇒ Object

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object

#add_vectors_by_split_recode(name_, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ Object

#bootstrap(n = nil) ⇒ Statsample::Dataset

#case_as_array(i) ⇒ Object

#case_as_hash(i) ⇒ Object

#check_fields(fields) ⇒ Object

#check_length ⇒ Object

#check_order ⇒ Object

#clear_gsl ⇒ Object

#clone(*fields_to_include) ⇒ Statsample::Dataset

#clone_only_valid(*fields_to_include) ⇒ Statsample::Dataset

#col(c) ⇒ Statsample::Vector Also known as: vector

#collect(type = :numeric) ⇒ Object

#collect_matrix ⇒ ::Matrix

#collect_with_index(type = :numeric) ⇒ Object

#compute(text) ⇒ Object

#correlation_matrix(fields = nil) ⇒ Object

#covariance_matrix(fields = nil) ⇒ Object

#crosstab(v1, v2, opts = {}) ⇒ Object

#delete_vector(*args) ⇒ Object

#dup(*fields_to_include) ⇒ Statsample::Dataset

#dup_empty ⇒ Statsample::Dataset

#dup_only_valid(*fields_to_include) ⇒ Object

#each ⇒ Object

#each_array ⇒ Object

#each_array_with_nils ⇒ Object

#each_vector ⇒ Object

#each_with_index ⇒ Object

#filter ⇒ Object

#filter_field(field) ⇒ Object

#from_to(from, to) ⇒ Object

#has_missing_data? ⇒ Boolean

#has_vector?(v) ⇒ Boolean

#inspect ⇒ Object

#join(other_ds, fields_1 = [], fields_2 = [], type = :left) ⇒ Statsample::Dataset

#merge(other_ds) ⇒ Statsample::Dataset

#nest(*tree_keys, &block) ⇒ Object

#one_to_many(parent_fields, pattern) ⇒ Object

#recode!(vector_name) ⇒ Object

#report_building(b) ⇒ Object

#standarize ⇒ Statsample::Dataset

#to_gsl ⇒ Object

#to_matrix ⇒ Object

#to_multiset_by_split(*fields) ⇒ Object

#to_multiset_by_split_multiple_fields(*fields) ⇒ Object

#to_multiset_by_split_one_field(field) ⇒ Object

#to_REXP ⇒ Object

#to_s ⇒ Object

#update_valid_data ⇒ Object

#vector_by_calculation(type = :numeric) ⇒ Object

#vector_count_characters(fields = nil) ⇒ Object

#vector_mean(fields = nil, max_invalid = 0) ⇒ Object

#vector_missing_values(fields = nil) ⇒ Object

#initialize(vectors = {}, fields = []) ⇒ `Dataset`

#cases ⇒ `Object` (readonly)

#fields ⇒ `Object`

#i ⇒ `Object` (readonly)

#name ⇒ `Object`

#vectors ⇒ `Object` (readonly)

.crosstab_by_asignation(rows, columns, values) ⇒ `Object`

#==(d2) ⇒ `Boolean`

#[](i) ⇒ `Object`

#[]=(i, v) ⇒ `Object`

#_case_as_array(c) ⇒ `Object`

#_case_as_hash(c) ⇒ `Object`

#add_case(v, uvd = true) ⇒ `Object`

#add_case_array(v) ⇒ `Object`

#add_vector(name, vector) ⇒ `Object`

#add_vectors_by_split(name, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

#add_vectors_by_split_recode(name_, join = '-', sep = Statsample::SPLIT_TOKEN) ⇒ `Object`

#bootstrap(n = nil) ⇒ `Statsample::Dataset`

#case_as_array(i) ⇒ `Object`

#case_as_hash(i) ⇒ `Object`

#check_fields(fields) ⇒ `Object`

#check_length ⇒ `Object`

#check_order ⇒ `Object`

#clear_gsl ⇒ `Object`

#clone(*fields_to_include) ⇒ `Statsample::Dataset`

#clone_only_valid(*fields_to_include) ⇒ `Statsample::Dataset`

#col(c) ⇒ `Statsample::Vector` Also known as: vector

#collect(type = :numeric) ⇒ `Object`

#collect_matrix ⇒ `::Matrix`

#collect_with_index(type = :numeric) ⇒ `Object`

#compute(text) ⇒ `Object`

#correlation_matrix(fields = nil) ⇒ `Object`

#covariance_matrix(fields = nil) ⇒ `Object`

#crosstab(v1, v2, opts = {}) ⇒ `Object`

#delete_vector(*args) ⇒ `Object`

#dup(*fields_to_include) ⇒ `Statsample::Dataset`

#dup_empty ⇒ `Statsample::Dataset`

#dup_only_valid(*fields_to_include) ⇒ `Object`

#each ⇒ `Object`

#each_array ⇒ `Object`

#each_array_with_nils ⇒ `Object`

#each_vector ⇒ `Object`

#each_with_index ⇒ `Object`

#filter ⇒ `Object`

#filter_field(field) ⇒ `Object`

#from_to(from, to) ⇒ `Object`

#has_missing_data? ⇒ `Boolean`

#has_vector?(v) ⇒ `Boolean`

#inspect ⇒ `Object`

#join(other_ds, fields_1 = [], fields_2 = [], type = :left) ⇒ `Statsample::Dataset`

#merge(other_ds) ⇒ `Statsample::Dataset`

#nest(*tree_keys, &block) ⇒ `Object`

#one_to_many(parent_fields, pattern) ⇒ `Object`

#recode!(vector_name) ⇒ `Object`

#report_building(b) ⇒ `Object`

#standarize ⇒ `Statsample::Dataset`

#to_gsl ⇒ `Object`

#to_matrix ⇒ `Object`

#to_multiset_by_split(*fields) ⇒ `Object`

#to_multiset_by_split_multiple_fields(*fields) ⇒ `Object`

#to_multiset_by_split_one_field(field) ⇒ `Object`

#to_REXP ⇒ `Object`

#to_s ⇒ `Object`

#update_valid_data ⇒ `Object`

#vector_by_calculation(type = :numeric) ⇒ `Object`

#vector_count_characters(fields = nil) ⇒ `Object`

#vector_mean(fields = nil, max_invalid = 0) ⇒ `Object`

#vector_missing_values(fields = nil) ⇒ `Object`

#vector_sum(fields = nil) ⇒ `Object`

#verify(*tests) ⇒ `Object`