Class: Statsample::Vector

Inherits:
Object show all
Includes:
Enumerable, Summarizable, VectorShorthands, Writable
Defined in:
lib/statsample/vector.rb,
lib/statsample/vector/gsl.rb,
lib/statsample/rserve_extension.rb

Overview

Collection of values on one dimension. Works as a column on a Spreadsheet.

Usage

The fast way to create a vector uses Array.to_vector or Array.to_numeric.

v=[1,2,3,4].to_vector(:numeric)
v=[1,2,3,4].to_numeric

Defined Under Namespace

Modules: GSL_

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from VectorShorthands

#to_numeric, #to_scale, #to_vector

Methods included from Summarizable

#summary

Methods included from Writable

#save

Constructor Details

#initialize(data = [], type = :object, opts = Hash.new) ⇒ Vector

Creates a new Vector object.

  • data Any data which can be converted on Array

  • type Level of meausurement. See Vector#type

  • opts Hash of options

    • :missing_values Array of missing values. See Vector#missing_values

    • :today_values Array of ‘today’ values. See Vector#today_values

    • :labels Labels for data values

    • :name Name of vector



80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/statsample/vector.rb', line 80

def initialize(data=[], type=:object, opts=Hash.new)
  if type == :ordinal or type == :scale
    $stderr.puts "WARNING: #{type} has been deprecated. Use :numeric instead."
    type = :numeric
  end

  if type == :nominal
    $stderr.puts "WARNING: nominal has been deprecated. Use :object instead."
    type = :object
  end

  @data=data.is_a?(Array) ? data : data.to_a
  @type=type
  opts_default={
    :missing_values=>[],
    :today_values=>['NOW','TODAY', :NOW, :TODAY],
    :labels=>{},
    :name=>nil
  }
  @opts=opts_default.merge(opts)
  if  @opts[:name].nil?
    @@n_table||=0
    @@n_table+=1
    @opts[:name]="Vector #{@@n_table}"
  end
  @missing_values=@opts[:missing_values]
  @labels=@opts[:labels]
  @today_values=@opts[:today_values]
  @name=@opts[:name]
  @valid_data=[]
  @data_with_nils=[]
  @date_data_with_nils=[]
  @missing_data=[]
  @has_missing_data=nil
  @numeric_data=nil
  set_valid_data
  self.type=type
end

Instance Attribute Details

#dataObject (readonly)

Original data.



54
55
56
# File 'lib/statsample/vector.rb', line 54

def data
  @data
end

#data_with_nilsObject (readonly)

Original data, with all missing values replaced by nils



64
65
66
# File 'lib/statsample/vector.rb', line 64

def data_with_nils
  @data_with_nils
end

#date_data_with_nilsObject (readonly)

Date date, with all missing values replaced by nils



66
67
68
# File 'lib/statsample/vector.rb', line 66

def date_data_with_nils
  @date_data_with_nils
end

#labelsObject

Change label for specific values



68
69
70
# File 'lib/statsample/vector.rb', line 68

def labels
  @labels
end

#missing_dataObject (readonly)

Missing values array



62
63
64
# File 'lib/statsample/vector.rb', line 62

def missing_data
  @missing_data
end

#missing_valuesObject

Array of values considered as missing. Nil is a missing value, by default



58
59
60
# File 'lib/statsample/vector.rb', line 58

def missing_values
  @missing_values
end

#nameObject

Name of vector. Should be used for output by many classes



70
71
72
# File 'lib/statsample/vector.rb', line 70

def name
  @name
end

#today_valuesObject

Array of values considered as “Today”, with date type. “NOW”, “TODAY”, :NOW and :TODAY are ‘today’ values, by default



60
61
62
# File 'lib/statsample/vector.rb', line 60

def today_values
  @today_values
end

#typeObject

Level of measurement. Could be :object, :numeric



52
53
54
# File 'lib/statsample/vector.rb', line 52

def type
  @type
end

#valid_dataObject (readonly)

Valid data. Equal to data, minus values assigned as missing values



56
57
58
# File 'lib/statsample/vector.rb', line 56

def valid_data
  @valid_data
end

Class Method Details

.[](*args) ⇒ Object

Create a vector using (almost) any object

  • Array: flattened

  • Range: transformed using to_a

  • Statsample::Vector

  • Numeric and string values



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# File 'lib/statsample/vector.rb', line 123

def self.[](*args)
  values=[]
  args.each do |a|
    case a
    when Array
      values.concat a.flatten
    when Statsample::Vector
      values.concat a.to_a
    when Range
      values.concat  a.to_a
    else
      values << a
    end
  end
  vector=new(values)
  vector.type=:numeric if vector.can_be_numeric?
  vector
end

._load(data) ⇒ Object

:nodoc:



256
257
258
259
# File 'lib/statsample/vector.rb', line 256

def self._load(data) # :nodoc:
h=Marshal.load(data)
Vector.new(h['data'], h['type'], :missing_values=> h['missing_values'], :labels=>h['labels'], :name=>h['name'])
end

.new_numeric(n, val = nil, &block) ⇒ Object

Create a new numeric type vector Parameters

n

Size

val

Value of each value

&block

If block provided, is used to set the values of vector



146
147
148
149
150
151
152
153
154
# File 'lib/statsample/vector.rb', line 146

def self.new_numeric(n,val=nil, &block)
  if block
    vector=n.times.map {|i| block.call(i)}.to_numeric
  else
    vector=n.times.map { val}.to_numeric
  end
  vector.type=:numeric
  vector
end

.new_scale(n, val = nil, &block) ⇒ Object

Deprecated. Use new_numeric instead.



157
158
159
160
# File 'lib/statsample/vector.rb', line 157

def self.new_scale(n, val=nil,&block)
  $stderr.puts "WARNING: .new_scale has been deprecated. Use .new_numeric instead."
  new_numeric n, val, &block
end

Instance Method Details

#*(v) ⇒ Object



451
452
453
# File 'lib/statsample/vector.rb', line 451

def *(v)
  _vector_ari("*",v)
end

#+(v) ⇒ Object

Vector sum.

  • If v is a scalar, add this value to all elements

  • If v is a Array or a Vector, should be of the same size of this vector every item of this vector will be added to the value of the item at the same position on the other vector



437
438
439
# File 'lib/statsample/vector.rb', line 437

def +(v)
_vector_ari("+",v)
end

#-(v) ⇒ Object

Vector rest.

  • If v is a scalar, rest this value to all elements

  • If v is a Array or a Vector, should be of the same size of this vector every item of this vector will be rested to the value of the item at the same position on the other vector



447
448
449
# File 'lib/statsample/vector.rb', line 447

def -(v)
_vector_ari("-",v)
end

#==(v2) ⇒ Object

Vector equality. Two vector will be the same if their data, missing values, type, labels are equals



247
248
249
250
# File 'lib/statsample/vector.rb', line 247

def ==(v2)
  return false unless v2.instance_of? Statsample::Vector
  @data==v2.data and @missing_values==v2.missing_values and @type==v2.type and @labels==v2.labels
end

#[](i) ⇒ Object

Retrieves i element of data



394
395
396
# File 'lib/statsample/vector.rb', line 394

def [](i)
  @data[i]
end

#[]=(i, v) ⇒ Object

Set i element of data. Note: Use set_valid_data if you include missing values



399
400
401
# File 'lib/statsample/vector.rb', line 399

def []=(i,v)
  @data[i]=v
end

#_check_type(t) ⇒ Object

:nodoc:

Raises:

  • (NoMethodError)


185
186
187
188
# File 'lib/statsample/vector.rb', line 185

def _check_type(t) #:nodoc:
  raise NoMethodError if (t == :numeric and @type == :object) or 
                         (t == :date)   or (:date == @type)
end

#_dump(i) ⇒ Object

:nodoc:



252
253
254
# File 'lib/statsample/vector.rb', line 252

def _dump(i) # :nodoc:
  Marshal.dump({'data'=>@data,'missing_values'=>@missing_values, 'labels'=>@labels, 'type'=>@type,'name'=>@name})
end

#_frequenciesObject

:nodoc:



775
776
777
778
779
780
781
# File 'lib/statsample/vector.rb', line 775

def _frequencies #:nodoc:
  @valid_data.inject(Hash.new) {|a,x|
    a[x]||=0
    a[x]=a[x]+1
    a
  }
end

#_set_valid_data_internObject

:nodoc:



351
352
353
354
355
356
357
358
359
360
361
362
# File 'lib/statsample/vector.rb', line 351

def _set_valid_data_intern #:nodoc:
  @data.each do |n|
    if is_valid? n
      @valid_data.push(n)
      @data_with_nils.push(n)
    else
      @data_with_nils.push(nil)
      @missing_data.push(n)
    end
  end
  @has_missing_data=@missing_data.size>0
end

#_vector_ari(method, v) ⇒ Object

:nodoc:



465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
# File 'lib/statsample/vector.rb', line 465

def _vector_ari(method,v) # :nodoc:
if(v.is_a? Vector or v.is_a? Array)
  raise ArgumentError, "The array/vector parameter (#{v.size}) should be of the same size of the original vector (#{@data.size})" unless v.size==@data.size
  sum=[]
  v.size.times {|i|
      if((v.is_a? Vector and v.is_valid?(v[i]) and is_valid?(@data[i])) or (v.is_a? Array and !v[i].nil? and !data[i].nil?))
          sum.push(@data[i].send(method,v[i]))
      else
          sum.push(nil)
      end
  }
  Statsample::Vector.new(sum, :numeric)
elsif(v.respond_to? method )
  Statsample::Vector.new(
    @data.collect  {|x|
      if(!x.nil?)
        x.send(method,v)
      else
        nil
      end
    } , :numeric)
else
    raise TypeError,"You should pass a scalar or a array/vector"
end

end

#add(v, update_valid = true) ⇒ Object

Add a value at the end of the vector. If second argument set to false, you should update the Vector usign Vector.set_valid_data at the end of your insertion cycle



314
315
316
317
# File 'lib/statsample/vector.rb', line 314

def add(v,update_valid=true)
  @data.push(v)
  set_valid_data if update_valid
end

#average_deviation_population(m = nil) ⇒ Object Also known as: adp

Population average deviation (denominator N) author: Al Chou



1003
1004
1005
1006
1007
# File 'lib/statsample/vector.rb', line 1003

def average_deviation_population( m = nil )
  check_type :numeric
  m ||= mean
  ( @numeric_data.inject( 0 ) { |a, x| ( x - m ).abs + a } ).quo( n_valid )
end

#bootstrap(estimators, nr, s = nil) ⇒ Object

Bootstrap

Generate nr resamples (with replacement) of size s from vector, computing each estimate from estimators over each resample. estimators could be a) Hash with variable names as keys and lambdas as values

a.bootstrap(:log_s2=>lambda {|v| Math.log(v.variance)},1000)

b) Array with names of method to bootstrap

a.bootstrap([:mean, :sd],1000)

c) A single method to bootstrap

a.jacknife(:mean, 1000)

If s is nil, is set to vector size by default.

Returns a dataset where each vector is an vector of length nr containing the computed resample estimates.



565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
# File 'lib/statsample/vector.rb', line 565

def bootstrap(estimators, nr, s=nil)
  s||=n

  h_est, es, bss= prepare_bootstrap(estimators)


  nr.times do |i|
    bs=sample_with_replacement(s)
    es.each do |estimator|
      # Add bootstrap
      bss[estimator].push(h_est[estimator].call(bs))
    end
  end

  es.each do |est|
    bss[est]=bss[est].to_numeric
    bss[est].type=:numeric
  end
  bss.to_dataset

end

#box_cox_transformation(lambda) ⇒ Object

:nodoc:



230
231
232
233
234
235
236
237
238
239
240
241
242
243
# File 'lib/statsample/vector.rb', line 230

def box_cox_transformation(lambda) # :nodoc:
  raise "Should be a numeric" unless @type==:numeric
  @data_with_nils.collect{|x|
  if !x.nil?
    if(lambda==0)
      Math.log(x)
    else
      (x**lambda-1).quo(lambda)
    end
  else
    nil
  end
  }.to_vector(:numeric)
end

#can_be_date?Boolean

Return true if all data is Date, “today” values or nil

Returns:

  • (Boolean)


719
720
721
722
723
724
725
726
# File 'lib/statsample/vector.rb', line 719

def can_be_date?
if @data.find {|v|
!v.nil? and !v.is_a? Date and !v.is_a? Time and (v.is_a? String and !@today_values.include? v) and (v.is_a? String and !(v=~/\d{4,4}[-\/]\d{1,2}[-\/]\d{1,2}/))}
  false
else
  true
end
end

#can_be_numeric?Boolean

Return true if all data is Numeric or nil

Returns:

  • (Boolean)


728
729
730
731
732
733
734
# File 'lib/statsample/vector.rb', line 728

def can_be_numeric?
  if @data.find {|v| !v.nil? and !v.is_a? Numeric and !@missing_values.include? v}
    false
  else
    true
  end
end

#check_type(t) ⇒ Object

:nodoc:



175
176
177
# File 'lib/statsample/vector.rb', line 175

def check_type(t)
  Statsample::STATSAMPLE__.check_type(self,t)
end

#coefficient_of_variationObject Also known as: cov

Coefficient of variation Calculed with the sample standard deviation



1075
1076
1077
1078
# File 'lib/statsample/vector.rb', line 1075

def coefficient_of_variation
    check_type :numeric
    standard_deviation_sample.quo(mean)
end

#count(x = false) ⇒ Object

Retrieves number of cases which comply condition. If block given, retrieves number of instances where block returns true. If other values given, retrieves the frequency for this value.



692
693
694
695
696
697
698
699
700
701
702
# File 'lib/statsample/vector.rb', line 692

def count(x=false)
if block_given?
  r=@data.inject(0) {|s, i|
    r=yield i
    s+(r ? 1 : 0)
  }
  r.nil? ? 0 : r
else
  frequencies[x].nil? ? 0 : frequencies[x]
end
end

#db_type(dbs = 'mysql') ⇒ Object

Returns the database type for the vector, according to its content



706
707
708
709
710
711
712
713
714
715
716
717
# File 'lib/statsample/vector.rb', line 706

def db_type(dbs='mysql')
# first, detect any character not number
if @data.find {|v|  v.to_s=~/\d{2,2}-\d{2,2}-\d{4,4}/} or @data.find {|v|  v.to_s=~/\d{4,4}-\d{2,2}-\d{2,2}/}
  return "DATE"
elsif @data.find {|v|  v.to_s=~/[^0-9e.-]/ }
  return "VARCHAR (255)"
elsif @data.find {|v| v.to_s=~/\./}
  return "DOUBLE"
else
  return "INTEGER"
end
end

#dichotomize(low = nil) ⇒ Object

Dicotomize the vector with 0 and 1, based on lowest value If parameter if defined, this value and lower will be 0 and higher, 1



284
285
286
287
288
289
290
291
292
293
294
295
296
# File 'lib/statsample/vector.rb', line 284

def dichotomize(low = nil)
  low ||= factors.min

  @data_with_nils.collect do |x|
    if x.nil?
      nil
    elsif x > low
      1
    else
      0
    end
  end.to_numeric
end

#dupObject

Creates a duplicate of the Vector. Note: data, missing_values and labels are duplicated, so changes on original vector doesn’t propages to copies.



164
165
166
# File 'lib/statsample/vector.rb', line 164

def dup
  Vector.new(@data.dup,@type, :missing_values => @missing_values.dup, :labels => @labels.dup, :name=>@name)
end

#dup_emptyObject

Returns an empty duplicate of the vector. Maintains the type, missing values and labels.



169
170
171
# File 'lib/statsample/vector.rb', line 169

def dup_empty
  Vector.new([],@type, :missing_values => @missing_values.dup, :labels => @labels.dup, :name=> @name)
end

#eachObject

Iterate on each item. Equivalent to

@data.each{|x| yield x}


300
301
302
# File 'lib/statsample/vector.rb', line 300

def each
  @data.each{|x| yield(x) }
end

#each_indexObject

Iterate on each item, retrieving index



305
306
307
308
309
# File 'lib/statsample/vector.rb', line 305

def each_index
(0...@data.size).each {|i|
  yield(i)
}
end

#factorsObject

Retrieves uniques values for data.



753
754
755
756
757
758
759
760
761
# File 'lib/statsample/vector.rb', line 753

def factors
  if @type==:numeric
    @numeric_data.uniq.sort
  elsif @type==:date
    @date_data_with_nils.uniq.sort
  else
    @valid_data.uniq.sort
  end
end

#frequenciesObject

:nodoc:



765
766
767
# File 'lib/statsample/vector.rb', line 765

def frequencies
  Statsample::STATSAMPLE__.frequencies(@valid_data)
end

#has_missing_data?Boolean Also known as: flawed?

Retrieves true if data has one o more missing values

Returns:

  • (Boolean)


365
366
367
# File 'lib/statsample/vector.rb', line 365

def has_missing_data?
  @has_missing_data
end

#histogram(bins = 10) ⇒ Object

With a fixnum, creates X bins within the range of data With an Array, each value will be a cut point



1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
# File 'lib/statsample/vector.rb', line 1050

def histogram(bins=10)
  check_type :numeric

  if bins.is_a? Array
    #h=Statsample::Histogram.new(self, bins)
    h=Statsample::Histogram.alloc(bins)
  else
    # ugly patch. The upper limit for a bin has the form
    # x < range
    #h=Statsample::Histogram.new(self, bins)
    min,max=Statsample::Util.nice(@valid_data.min,@valid_data.max)
    # fix last data
    if max==@valid_data.max
      max+=1e-10
    end
    h=Statsample::Histogram.alloc(bins,[min,max])
    # Fix last bin

  end
  h.increment(@valid_data)
  h
end

#inspectObject



749
750
751
# File 'lib/statsample/vector.rb', line 749

def inspect
  self.to_s
end

#is_valid?(x) ⇒ Boolean

Return true if a value is valid (not nil and not included on missing values)

Returns:

  • (Boolean)


403
404
405
# File 'lib/statsample/vector.rb', line 403

def is_valid?(x)
  !(x.nil? or @missing_values.include? x)
end

#jacknife(estimators, k = 1) ⇒ Object

Jacknife

Returns a dataset with jacknife delete-k estimators estimators could be: a) Hash with variable names as keys and lambdas as values

a.jacknife(:log_s2=>lambda {|v| Math.log(v.variance)})

b) Array with method names to jacknife

a.jacknife([:mean, :sd])

c) A single method to jacknife

a.jacknife(:mean)

k represent the block size for block jacknife. By default is set to 1, for classic delete-one jacknife.

Returns a dataset where each vector is an vector of length cases/k containing the computed jacknife estimates.

Reference:

  • Sawyer, S. (2005). Resampling Data: Using a Statistical Jacknife.



604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
# File 'lib/statsample/vector.rb', line 604

def jacknife(estimators, k=1)
  raise "n should be divisible by k:#{k}" unless n%k==0

  nb=(n / k).to_i


  h_est, es, ps= prepare_bootstrap(estimators)

  est_n=es.inject({}) {|h,v|
    h[v]=h_est[v].call(self)
    h
  }


  nb.times do |i|
    other=@data_with_nils.dup
    other.slice!(i*k,k)
    other=other.to_numeric
    es.each do |estimator|
      # Add pseudovalue
      ps[estimator].push( nb * est_n[estimator] - (nb-1) * h_est[estimator].call(other))
    end
  end


  es.each do |est|
    ps[est]=ps[est].to_numeric
    ps[est].type=:numeric
  end
  ps.to_dataset
end

#kurtosis(m = nil) ⇒ Object

Kurtosis of the sample



1034
1035
1036
1037
1038
1039
1040
# File 'lib/statsample/vector.rb', line 1034

def kurtosis(m=nil)
    check_type :numeric
    m||=mean
    fo=@numeric_data.inject(0){|a,x| a+((x-m)**4)}
    fo.quo((@numeric_data.size)*sd(m)**4)-3

end

#labeling(x) ⇒ Object Also known as: label

Retrieves label for value x. Retrieves x if no label defined.



372
373
374
# File 'lib/statsample/vector.rb', line 372

def labeling(x)
  @labels.has_key?(x) ? @labels[x].to_s : x.to_s
end

#maxObject

Maximum value



920
921
922
923
# File 'lib/statsample/vector.rb', line 920

def max
  check_type :numeric
  @valid_data.max
end

#meanObject

The arithmetical mean of data



966
967
968
969
# File 'lib/statsample/vector.rb', line 966

def mean
  check_type :numeric
  sum.to_f.quo(n_valid)
end

#medianObject

Return the median (percentil 50)



910
911
912
913
# File 'lib/statsample/vector.rb', line 910

def median
  check_type :numeric
  percentil(50)
end

#median_absolute_deviationObject Also known as: mad



1008
1009
1010
1011
# File 'lib/statsample/vector.rb', line 1008

def median_absolute_deviation
  med=median
  recode {|x| (x-med).abs}.median
end

#minObject

Minimun value



915
916
917
918
# File 'lib/statsample/vector.rb', line 915

def min
  check_type :numeric
  @valid_data.min
end

#modeObject

Returns the most frequent item.



784
785
786
# File 'lib/statsample/vector.rb', line 784

def mode
  frequencies.max{|a,b| a[1]<=>b[1]}.first
end

#n_validObject

The numbers of item with valid data.



788
789
790
# File 'lib/statsample/vector.rb', line 788

def n_valid
  @valid_data.size
end

#percentil(q, strategy = :midpoint) ⇒ Object

Percentil

Returns the value of the percentile q

Accepts an optional second argument specifying the strategy to interpolate when the requested percentile lies between two data points a and b Valid strategies are:

  • :midpoint (Default): (a + b) / 2

  • :linear : a + (b - a) * d where d is the decimal part of the index between a and b.

This is the NIST recommended method (en.wikipedia.org/wiki/Percentile#NIST_method)



868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
# File 'lib/statsample/vector.rb', line 868

def percentil(q, strategy = :midpoint)
  check_type :numeric
  sorted=@valid_data.sort

  case strategy
  when :midpoint
    v = (n_valid * q).quo(100)
    if(v.to_i!=v)
      sorted[v.to_i]
    else
      (sorted[(v-0.5).to_i].to_f + sorted[(v+0.5).to_i]).quo(2)
    end
  when :linear
    index = (q / 100.0) * (n_valid + 1)

    k = index.truncate
    d = index % 1

    if k == 0
      sorted[0]
    elsif k >= sorted.size
      sorted[-1]
    else
      sorted[k - 1] + d * (sorted[k] - sorted[k - 1])
    end
  else
    raise NotImplementedError.new "Unknown strategy #{strategy.to_s}"
  end
end

#productObject

Product of all values on the sample



1043
1044
1045
1046
# File 'lib/statsample/vector.rb', line 1043

def product
    check_type :numeric
    @numeric_data.inject(1){|a,x| a*x }
end

#proportion(v = 1) ⇒ Object

Proportion of a given value.



800
801
802
# File 'lib/statsample/vector.rb', line 800

def proportion(v=1)
    frequencies[v].quo(@valid_data.size)
end

#proportion_confidence_interval_t(n_poblation, margin = 0.95, v = 1) ⇒ Object



840
841
842
# File 'lib/statsample/vector.rb', line 840

def proportion_confidence_interval_t(n_poblation,margin=0.95,v=1)
  Statsample::proportion_confidence_interval_t(proportion(v), @valid_data.size, n_poblation, margin)
end

#proportion_confidence_interval_z(n_poblation, margin = 0.95, v = 1) ⇒ Object



843
844
845
# File 'lib/statsample/vector.rb', line 843

def proportion_confidence_interval_z(n_poblation,margin=0.95,v=1)
  Statsample::proportion_confidence_interval_z(proportion(v), @valid_data.size, n_poblation, margin)
end

#proportionsObject

Returns a hash with the distribution of proportions of the sample.



793
794
795
796
797
798
# File 'lib/statsample/vector.rb', line 793

def proportions
    frequencies.inject({}){|a,v|
        a[v[0]] = v[1].quo(n_valid)
        a
    }
end

#push(v) ⇒ Object



276
277
278
279
# File 'lib/statsample/vector.rb', line 276

def push(v)
  @data.push(v)
  set_valid_data
end

#rangeObject

The range of the data (max - min)



956
957
958
959
# File 'lib/statsample/vector.rb', line 956

def range;
  check_type :numeric
  @numeric_data.max - @numeric_data.min
end

#ranked(type = :numeric) ⇒ Object

Returns a ranked vector.



899
900
901
902
903
904
905
906
907
908
# File 'lib/statsample/vector.rb', line 899

def ranked(type=:numeric)
  check_type :numeric
  i=0
  r=frequencies.sort.inject({}){|a,v|
    a[v[0]]=(i+1 + i+v[1]).quo(2)
    i+=v[1]
    a
  }
  @data.collect {|c| r[c] }.to_vector(type)
end

#recode(type = nil) ⇒ Object

Returns a new vector, with data modified by block. Equivalent to create a Vector after #collect on data



262
263
264
265
266
267
# File 'lib/statsample/vector.rb', line 262

def recode(type=nil)
  type||=@type
  @data.collect{|x|
    yield x
  }.to_vector(type)
end

#recode!Object

Modifies current vector, with data modified by block. Equivalent to #collect! on @data



270
271
272
273
274
275
# File 'lib/statsample/vector.rb', line 270

def recode!
@data.collect!{|x|
  yield x
}
set_valid_data
end

#report_building(b) ⇒ Object



803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
# File 'lib/statsample/vector.rb', line 803

def report_building(b)
  b.section(:name=>name) do |s|
    s.text _("n :%d") % n
    s.text _("n valid:%d") % n_valid
    if @type==:object
      s.text  _("factors:%s") % factors.join(",")
      s.text   _("mode: %s") % mode

      s.table(:name=>_("Distribution")) do |t|
        frequencies.sort.each do |k,v|
          key=labels.has_key?(k) ? labels[k]:k
          t.row [key, v , ("%0.2f%%" % (v.quo(n_valid)*100))]
        end
      end
    end

    s.text _("median: %s") % median.to_s if(@type==:numeric or @type==:numeric)
    if(@type==:numeric)
      s.text _("mean: %0.4f") % mean
      if sd
        s.text _("std.dev.: %0.4f") % sd
        s.text _("std.err.: %0.4f") % se
        s.text _("skew: %0.4f") % skew
        s.text _("kurtosis: %0.4f") % kurtosis
      end
    end
  end
end

#sample_with_replacement(sample = 1) ⇒ Object

Returns an random sample of size n, with replacement, only with valid data.

In all the trails, every item have the same probability of been selected.



666
667
668
669
# File 'lib/statsample/vector.rb', line 666

def sample_with_replacement(sample=1)
  vds=@valid_data.size
  (0...sample).collect{ @valid_data[rand(vds)] }
end

#sample_without_replacement(sample = 1) ⇒ Object

Returns an random sample of size n, without replacement, only with valid data.

Every element could only be selected once.

A sample of the same size of the vector is the vector itself.

Raises:

  • (ArgumentError)


677
678
679
680
681
682
683
684
685
686
# File 'lib/statsample/vector.rb', line 677

def sample_without_replacement(sample=1)
  raise ArgumentError, "Sample size couldn't be greater than n" if sample>@valid_data.size
  out=[]
  size=@valid_data.size
  while out.size<sample
    value=rand(size)
    out.push(value) if !out.include?value
  end
  out.collect{|i| @data[i]}
end

#set_valid_dataObject

Update valid_data, missing_data, data_with_nils and gsl at the end of an insertion.

Use after Vector.add(v,false) Usage:

v=Statsample::Vector.new
v.add(2,false)
v.add(4,false)
v.data
=> [2,3]
v.valid_data
=> []
v.set_valid_data
v.valid_data
=> [2,3]


333
334
335
336
337
338
339
340
341
# File 'lib/statsample/vector.rb', line 333

def set_valid_data
  @valid_data.clear
  @missing_data.clear
  @data_with_nils.clear
  @date_data_with_nils.clear
  set_valid_data_intern
  set_numeric_data if(@type==:numeric)
  set_date_data if(@type==:date)
end

#set_valid_data_internObject

:nodoc:



343
344
345
# File 'lib/statsample/vector.rb', line 343

def set_valid_data_intern #:nodoc:
  Statsample::STATSAMPLE__.set_valid_data_intern(self)
end

#sizeObject Also known as: n

Size of total data



388
389
390
# File 'lib/statsample/vector.rb', line 388

def size
  @data.size
end

#skew(m = nil) ⇒ Object

Skewness of the sample



1027
1028
1029
1030
1031
1032
# File 'lib/statsample/vector.rb', line 1027

def skew(m=nil)
    check_type :numeric
    m||=mean
    th=@numeric_data.inject(0){|a,x| a+((x-m)**3)}
    th.quo((@numeric_data.size)*sd(m)**3)
end

#split_by_separator(sep = Statsample::SPLIT_TOKEN) ⇒ Object

Returns a hash of Vectors, defined by the different values defined on the fields Example:

a=Vector.new(["a,b","c,d","a,b"])
a.split_by_separator
=>  {"a"=>#<Statsample::Type::object:0x7f2dbcc09d88
      @data=[1, 0, 1]>,
     "b"=>#<Statsample::Type::object:0x7f2dbcc09c48
      @data=[1, 1, 0]>,
    "c"=>#<Statsample::Type::object:0x7f2dbcc09b08
      @data=[0, 1, 1]>}


520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
# File 'lib/statsample/vector.rb', line 520

def split_by_separator(sep=Statsample::SPLIT_TOKEN)
split_data=splitted(sep)
factors=split_data.flatten.uniq.compact
out=factors.inject({}) {|a,x|
  a[x]=[]
  a
}
split_data.each do |r|
  if r.nil?
    factors.each do |f|
      out[f].push(nil)
    end
  else
    factors.each do |f|
      out[f].push(r.include?(f) ? 1:0)
    end
  end
end
out.inject({}){|s,v|
  s[v[0]]=Vector.new(v[1],:object)
  s
}
end

#split_by_separator_freq(sep = Statsample::SPLIT_TOKEN) ⇒ Object



543
544
545
546
547
548
# File 'lib/statsample/vector.rb', line 543

def split_by_separator_freq(sep=Statsample::SPLIT_TOKEN)
  split_by_separator(sep).inject({}) {|a,v|
    a[v[0]]=v[1].inject {|s,x| s+x.to_i}
    a
  }
end

#splitted(sep = Statsample::SPLIT_TOKEN) ⇒ Object

Return an array with the data splitted by a separator.

a=Vector.new(["a,b","c,d","a,b","d"])
a.splitted
  =>
[["a","b"],["c","d"],["a","b"],["d"]]


496
497
498
499
500
501
502
503
504
505
506
# File 'lib/statsample/vector.rb', line 496

def splitted(sep=Statsample::SPLIT_TOKEN)
@data.collect{|x|
  if x.nil?
    nil
  elsif (x.respond_to? :split)
    x.split(sep)
  else
    [x]
  end
}
end

#standard_deviation_population(m = nil) ⇒ Object Also known as: sdp

Population Standard deviation (denominator N)



995
996
997
998
# File 'lib/statsample/vector.rb', line 995

def standard_deviation_population(m=nil)
  check_type :numeric
  Math::sqrt( variance_population(m) )
end

#standard_deviation_sample(m = nil) ⇒ Object Also known as: sds, sd

Sample Standard deviation (denominator n-1)



1021
1022
1023
1024
1025
# File 'lib/statsample/vector.rb', line 1021

def standard_deviation_sample(m=nil)
    check_type :numeric
    m||=mean
    Math::sqrt(variance_sample(m))
end

#standard_errorObject Also known as: se

Standard error of the distribution mean Calculated using sd/sqrt(n)



1081
1082
1083
# File 'lib/statsample/vector.rb', line 1081

def standard_error
  standard_deviation_sample.quo(Math.sqrt(valid_data.size))
end

#sumObject

The sum of values for the data



961
962
963
964
# File 'lib/statsample/vector.rb', line 961

def sum
  check_type :numeric
  @numeric_data.inject(0){|a,x|x+a} ;
end

#sum_of_squared_deviationObject

Sum of squared deviation



980
981
982
983
# File 'lib/statsample/vector.rb', line 980

def sum_of_squared_deviation
  check_type :numeric
  @numeric_data.inject(0) {|a,x| x.square+a} - (sum.square.quo(n_valid))
end

#sum_of_squares(m = nil) ⇒ Object Also known as: ss

Sum of squares for the data around a value. By default, this value is the mean

ss= sum{(xi-m)^2}


974
975
976
977
978
# File 'lib/statsample/vector.rb', line 974

def sum_of_squares(m=nil)
  check_type :numeric
  m||=mean
  @numeric_data.inject(0){|a,x| a+(x-m).square}
end

#to_aObject Also known as: to_ary



423
424
425
426
427
428
429
# File 'lib/statsample/vector.rb', line 423

def to_a
  if @data.is_a? Array
    @data.dup
  else
    @data.to_a
  end
end

#to_matrix(dir = :horizontal) ⇒ Object

Ugly name. Really, create a Vector for standard ‘matrix’ package. dir could be :horizontal or :vertical



741
742
743
744
745
746
747
748
# File 'lib/statsample/vector.rb', line 741

def to_matrix(dir=:horizontal)
  case dir
  when :horizontal
    Matrix[@data]
  when :vertical
    Matrix.columns([@data])
  end
end

#to_REXPObject



6
7
8
# File 'lib/statsample/rserve_extension.rb', line 6

def to_REXP
  Rserve::REXP::Wrapper.wrap(data_with_nils)
end

#to_sObject



736
737
738
# File 'lib/statsample/vector.rb', line 736

def to_s
  sprintf("Vector(type:%s, n:%d)[%s]",@type.to_s,@data.size, @data.collect{|d| d.nil? ? "nil":d}.join(","))
end

#variance_population(m = nil) ⇒ Object

Population variance (denominator N)



986
987
988
989
990
991
# File 'lib/statsample/vector.rb', line 986

def variance_population(m=nil)
  check_type :numeric
  m||=mean
  squares=@numeric_data.inject(0){|a,x| x.square+a}
  squares.quo(n_valid) - m.square
end

#variance_proportion(n_poblation, v = 1) ⇒ Object

Variance of p, according to poblation size



833
834
835
# File 'lib/statsample/vector.rb', line 833

def variance_proportion(n_poblation, v=1)
  Statsample::proportion_variance_sample(self.proportion(v), @valid_data.size, n_poblation)
end

#variance_sample(m = nil) ⇒ Object Also known as: variance

Sample Variance (denominator n-1)



1014
1015
1016
1017
1018
# File 'lib/statsample/vector.rb', line 1014

def variance_sample(m=nil)
  check_type :numeric
  m||=mean
  sum_of_squares(m).quo(n_valid - 1)
end

#variance_total(n_poblation, v = 1) ⇒ Object

Variance of p, according to poblation size



837
838
839
# File 'lib/statsample/vector.rb', line 837

def variance_total(n_poblation, v=1)
  Statsample::total_variance_sample(self.proportion(v), @valid_data.size, n_poblation)
end

#vector_centeredObject Also known as: centered

Return a centered vector



210
211
212
213
214
215
216
217
# File 'lib/statsample/vector.rb', line 210

def vector_centered
  check_type :numeric
  m=mean
  return ([nil]*size).to_numeric if mean.nil?
  vector=vector_centered_compute(m)
  vector.name=_("%s(centered)") % @name
  vector
end

#vector_centered_compute(m) ⇒ Object

:nodoc:



206
207
208
# File 'lib/statsample/vector.rb', line 206

def vector_centered_compute(m) #:nodoc:
  @data_with_nils.collect {|x| x.nil? ? nil : x.to_f-m }.to_numeric
end

#vector_labeledObject

Returns a Vector with data with labels replaced by the label.



377
378
379
380
381
382
383
384
385
386
# File 'lib/statsample/vector.rb', line 377

def vector_labeled
  d=@data.collect{|x|
    if @labels.has_key? x
      @labels[x]
    else
      x
    end
  }
  Vector.new(d,@type)
end

#vector_percentilObject

Return a vector with values replaced with the percentiles of each values



223
224
225
226
227
228
229
# File 'lib/statsample/vector.rb', line 223

def vector_percentil
  check_type :numeric
  c=@valid_data.size
  vector=ranked.map {|i| i.nil? ? nil : (i.quo(c)*100).to_f }.to_vector(@type)
  vector.name=_("%s(percentil)")  % @name
  vector
end

#vector_standarized(use_population = false) ⇒ Object Also known as: standarized

Return a vector usign the standarized values for data with sd with denominator n-1. With variance=0 or mean nil, returns a vector of equal size full of nils



197
198
199
200
201
202
203
204
205
# File 'lib/statsample/vector.rb', line 197

def vector_standarized(use_population=false)
  check_type :numeric
  m=mean
  sd=use_population ? sdp : sds
  return ([nil]*size).to_numeric if mean.nil? or sd==0.0
  vector=vector_standarized_compute(m,sd)
  vector.name=_("%s(standarized)")  % @name
  vector
end

#vector_standarized_compute(m, sd) ⇒ Object

:nodoc:



190
191
192
# File 'lib/statsample/vector.rb', line 190

def vector_standarized_compute(m,sd) # :nodoc:
  @data_with_nils.collect{|x| x.nil? ? nil : (x.to_f - m).quo(sd) }.to_vector(:numeric)
end

#verifyObject

Reports all values that doesn’t comply with a condition. Returns a hash with the index of data and the invalid data.



456
457
458
459
460
461
462
463
464
# File 'lib/statsample/vector.rb', line 456

def verify
h={}
(0...@data.size).to_a.each{|i|
  if !(yield @data[i])
    h[i]=@data[i]
  end
}
h
end