Module: GECS

Defined in:
lib/GECS.rb

Overview

Gem for Experimental Computer Science

Author

David Flater <[email protected]>

Copyright

Public domain

License

Unlicense

The Gem for Experimental Computer Science (GECS) is for managing the data resulting from experiments on software and IT systems. It realizes a data model that disambiguates the vocabulary of experimentation and measurement for a computer science audience. It also provides convenience functions to determine confidence intervals, analyze main effects and interactions, and export data in an R-compatible format for further analysis and visualization.

This software is experimental. NIST assumes no responsibility whatsoever for its use by other parties and makes no guarantees, expressed or implied, about its quality, reliability, or any other characteristic.

Conventions used within the GECS module

The following abbreviations are used:

Parm

parameter

Est

estimation or estimate

Inv

interval

Ind

independent

Dep

dependent

Var

variable

Treat

treatment (a vector of factor levels)

Arrays of character strings (or in one case, ParmDef structs) are used as a kind of “enum.” Values of the enum pseudo-type are unsigned integers that simply index the array to provide a concise identifier for whatever the referenced character string (or ParmDef) describes.

A good many helper methods that should not be exposed through the GECS API unfortunately are exposed and get listed by rdoc. These have been tagged (Private) in their comments.

Rdoc lists struct definitions in the Constants section.

Dependencies

To use the ::bootstrapMeans method:

  • The Ruby gem parallel must be installed. Tested version: 0.9.2.

  • The R environment for statistical computing must be runnable from a shell command line. Tested version: 3.1.0.

  • The R package bootBCa must be installed. Tested version: 1.0.

To use the ::quickMeans method:

  • The Ruby gem statistics2 must be installed. Tested version: 0.54.

Both ::bootstrapMeans and ::quickMeans use set, a standard pre-installed class.

Bad behaviors

::bootstrapMeans writes temporary files into the current working directory. Normally, they will be deleted when no longer in use.

The inverse t distribution in statistics2 that is used by ::quickMeans agrees with R and Octave only to 4 decimals or so.

Defined Under Namespace

Classes: BagOfHolding, DoubleBag, Experiment, Key, ParPod, ParmData, ParmDef

Constant Summary collapse

Version =

Integer constant indicating the version of GECS that has been loaded. There is no major/minor/patchlevel encoding. It just increments.

1
Parms =

Array of character strings (pseudo-enum definition) used to identify parameters.

Parameters are theoretically fixed but unknown metrics of a population that is bigger than the sample. They can only be estimated, and with estimation methods potentially being computationally expensive and complex, it may be important to preserve those estimates.

  • mean

  • standard deviation

  • variance

This is a non-prescriptive definition for example or default use. Each export file encapsulates its own definitions.

[
  "mean",
  "standard deviation",
  "variance"
]
EstMethods =

Array of character strings (pseudo-enum definition) used to identify estimation methods.

  • original

  • bootstrap, percentile interval

  • bootstrap, BCa interval

  • bootstrap-t interval

This is a non-prescriptive definition for example or default use. Each export file encapsulates its own definitions.

[
  "original",             # Standard formulae using normal approximations.
  "bootstrap, percentile interval",
  "bootstrap, BCa interval",
  "bootstrap-t interval"  # A.k.a. studentized bootstrap
]
EstMethodParms =

Array of character strings (pseudo-enum definition) used to identify parameters of estimation methods.

Different estimation methods will have different parameters. Interval type matters for asymmetrical distributions. Nested bootstrap replica count is used for bootstrap-t. An adaptive bootstrap may vary the replica count to achieve a precision specified as a numerical tolerance.

  • coverage probability

  • interval type

  • bootstrap replica count

  • nested bootstrap replica count

  • numerical tolerance of adaptive bootstrap

  • estimated attained precision of adaptive bootstrap

This is a non-prescriptive definition for example or default use. Each export file encapsulates its own definitions.

[
  "coverage probability",
  "interval type",
  "bootstrap replica count",
  "nested bootstrap replica count",
  "numerical tolerance of adaptive bootstrap",
  "estimated attained precision of adaptive bootstrap"
]
InvTypes =

Array of character strings (pseudo-enum definition) used to identify types of confidence intervals.

  • probabilistically symmetric

  • shortest

This is a non-prescriptive definition for example or default use. Each export file encapsulates its own definitions.

[
  "probabilistically symmetric",     # A.k.a. equi-tailed.
  "shortest"
]

Class Method Summary collapse

Class Method Details

.bootstrapMeans(bagOfHolding, id, delta) ⇒ Object

Add the following parameter to the results and effects of a specified experiment: mean, bootstrap method, BCa interval, 95% confidence, adaptive determination of bootstrap replica count to achieve the requested numerical tolerances for each depvar. The estimate provided is the sample mean, not the bootstrap estimate. Count will be at least 50000.

This function is parallelized. All available CPUs will be used to run bootstrap calculations in R.

bagOfHolding

A BagOfHolding.

id

Experiment id.

delta

Array of numbers specifying desired numerical tolerances for each depvar. To reduce variation below the resolution of a typical plot of height 1000 pixels, you’d want delta something like (max(y)-min(y))/2000 for whatever range of y is being plotted.



682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
# File 'lib/GECS.rb', line 682

def GECS.bootstrapMeans(bagOfHolding,id,delta)
  require 'parallel'
  throw "Bag is nil" if bagOfHolding.nil?
  throw "Experiments are nil" if bagOfHolding.experiments.nil?
  throw "No such experiment" if id >= bagOfHolding.experiments.length

  exp = bagOfHolding.experiments[id]
  throw "Null experiment" if exp.nil?
  throw "Null indvars" if exp.indVars.nil?
  throw "Null depvars" if exp.depVars.nil?
  numfacs = exp.indVars.length
  numdeps = exp.depVars.length
  throw "Not enough indvars" if numfacs < 1
  throw "Not enough depvars" if numdeps < 1
  throw "Wrong number of deltas" if delta.length != numdeps

  parmDef = ParmDef.new(bagOfHolding.getOrAddParm("mean"),
    bagOfHolding.getOrAddEstMethod("bootstrap, BCa interval"),
    {bagOfHolding.getOrAddEstMethodParm("coverage probability")=>0.95})
  meanId = bagOfHolding.getOrAddParmDef(parmDef)

  # Make a list of the data and effects that need fixing with a unique
  # serial number assigned to each.
  sernum = 0
  pods = Array.new
  enumerateKeys(bagOfHolding,id).each{|key|
    for di in 0..numdeps-1
      pods.push(ParPod.new(key,di,(sernum+=1),nil))
    end
  }

  # Run numCPUs instances of BCa.R in parallel.
  pods = Parallel.map(pods) do |pod|
    poddata = refactorExtract(bagOfHolding,pod.key,pod.di)
    unless poddata.nil?
      descript = "Treatment " + pod.key.treat.to_s + " depvar " + pod.di.to_s
      fnam = "bootstrap-in-" + pod.sernum.to_s + ".txt"
	fp = File.open(fnam,"w")
	fp.puts(poddata)
	fp.close

      # This R script that has been mangled onto the command line to avoid
      # adding another external dependency is mostly just a wrapper for
      # BCa.R, but the bootstrap estimate of the mean is replaced by the
      # sample mean.
	cmd = "Rscript " +
        "-e 'library(\"bootBCa\")' " +
        "-e 'data <- unlist(read.table(\"" + fnam + "\",header=F,colClasses=\"numeric\"))' " +
        "-e 'out <- BCa(data,as.numeric(" + delta[pod.di].to_s + "),mean)' " +
        "-e 'cat(sprintf(\"%d %0.16f %0.16f %0.16f %0.16f\\n\",out[1],out[2],mean(data),out[4],out[5]))'"

	pod.out = `#{cmd}`.split
	File.delete(fnam)
	if pod.out.length < 5
 # This should never happen.
 print "Bootstrap failure\n"
 print "  ", descript, "\n"
 print "  Depvar: ", bagOfHolding.experiments[id].depVars[pod.di], "\n"
 throw "Bootstrap failure"
	end

      # Verbose progress reporting
      print descript + ", " + pod.out[0] + " iterations done\n"
    end
    pod
  end

  # Copy back parameter estimates.
  bagOfHolding.ests ||= Hash.new
  pods.each{|pod|
    pd = parmRet(bagOfHolding,delta,pod)
    unless pd.nil?
      bagOfHolding.ests[pod.key] ||= Array.new(numdeps,nil)
      bagOfHolding.ests[pod.key][pod.di] ||= Hash.new
      bagOfHolding.ests[pod.key][pod.di][meanId] = pd
    end
  }
end

.describeParms(bagOfHolding, id) ⇒ Object

Print out parameter metadata for an experiment.



420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
# File 'lib/GECS.rb', line 420

def GECS.describeParms(bagOfHolding,id)
  experiment = bagOfHolding.experiments[id]
  ests = bagOfHolding.ests.select{|k,v| k.experimentId==id}
  ests.each{|k,v|
    print "Treatment " + k.treat.to_s + "\n"
    v.each_index{|di|
      print "  Depvar ", experiment.depVars[di], "\n"
      if v[di].nil?
        print "    Not applicable\n"
      else
 v[di].each{|pk,parmData|
   parmDef = bagOfHolding.parmDefs[pk]
   print "    Parameter: ", bagOfHolding.parms[parmDef.parm], "\n"
   print "    Estimation method: ", bagOfHolding.estMethods[parmDef.estMethod], "\n"
   print "    Global estimation method parameters:", "\n"
   printEstMethodParms(bagOfHolding, parmDef.estMethodParms)
   print "    Local estimation method parameters:", "\n"
   printEstMethodParms(bagOfHolding, parmData.estMethodParms)
 }
      end
    }
  }
end

.dumpData(bagOfHolding, id) ⇒ Object

Dump an experiment’s raw data (not parameter estimates) as an R table (with header) with a column for each dependent variable and N rows for each treatment. Short and missing series are padded with NAs.



311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
# File 'lib/GECS.rb', line 311

def GECS.dumpData(bagOfHolding,id)
  throw "Bag is nil" if bagOfHolding.nil?
  throw "Experiments are nil" if bagOfHolding.experiments.nil?
  throw "No such experiment" if id >= bagOfHolding.experiments.length
  exp = bagOfHolding.experiments[id]
  raise "Experiment has no independent variables!" if exp.indVars.empty?
  raise "Experiment has no dependent variables!" if exp.depVars.empty?
  dump = quotesome(exp.indVars) + " " + quotesome(exp.depVars) + "\n"
  results = bagOfHolding.data.select{|k,v| k.experimentId==id}
  results.each{|k,cellarray|
    raise "Null treatment data" if cellarray.nil?
    raise "Bad treatment data" if cellarray.length != exp.depVars.length
    maxlen = cellarray.map{|cell| cell.nil? ? 1 : cell.length}.max
    for iteration in 0..maxlen-1
	dump << quotesome(k.treat)
	cellarray.each{|cell|
 dump << " " + quotemaybe(cell.nil? ? nil : cell[iteration]).to_s
	}
	dump << "\n"
    end
  }
  dump
end

.dumpParms(bagOfHolding, id, prec = false) ⇒ Object

Dump parameter estimates for an experiment as an R table (with header) with crudely constructed column names: depVar X parmDefId X [est, lo, hi, optionally prec]. Estimates are assumed to be scalars. If prec is true, add a prec column for each parameter containing the value of the estMethodParm “estimated attained precision of adaptive bootstrap”.



340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# File 'lib/GECS.rb', line 340

def GECS.dumpParms(bagOfHolding,id,prec=false)
  throw "Bag is nil" if bagOfHolding.nil?
  throw "Experiments are nil" if bagOfHolding.experiments.nil?
  throw "No such experiment" if id >= bagOfHolding.experiments.length
  experiment = bagOfHolding.experiments[id]
  ests = bagOfHolding.ests.select{|k,v| k.experimentId==id}
  parms = ests[ests.keys[0]][0].keys.sort
  precparm = bagOfHolding.getOrAddEstMethodParm("estimated attained precision of adaptive bootstrap") if prec
  dump = "# Key to parameter ID numbers:\n"
  parms.each{|x|
    parmDef = bagOfHolding.parmDefs[x]
    dump += "# " + x.to_s + " = " + bagOfHolding.parms[parmDef.parm] +
                            ", " + bagOfHolding.estMethods[parmDef.estMethod]
    parmDef.estMethodParms.each{|k,v|
      dump += ", " + bagOfHolding.estMethodParms[k] + "=" + v.to_s
    }
    dump += "\n"
  }
  dump += quotesome(experiment.indVars)
  experiment.depVars.each{|d|
    parms.each{|p|
      dump += " \"" + d + " " + p.to_s + " est\"" +
              " \"" + d + " " + p.to_s + " lo\"" +
              " \"" + d + " " + p.to_s + " hi\""
      dump += " \"" + d + " " + p.to_s + " prec\"" if prec
    }
  }
  dump += "\n"
  ests.each{|k,cells|
    dump += quotesome(k.treat)
    cells.each{|cell|
      parms.each{|p|
        if cell.nil?
          dump += " NA NA NA"
          dump += " NA" if prec
        else
          parm = cell[p]
          dump += " " + quotemaybe(parm.est).to_s + " " + quotemaybe(parm.lo).to_s + " " + quotemaybe(parm.hi).to_s
          if prec
            if parm.estMethodParms.nil?
              dump += " NA"
            else
              dump += " " + quotemaybe(parm.estMethodParms[precparm]).to_s
            end
          end
        end
      }
    }
    dump += "\n"
  }
  dump
end

.enumerateKeys(bagOfHolding, id) ⇒ Object

(Private) Make a list of the Keys for all treatments, main effects, and 2-way interactions for an experiment. An attempt is made to suppress interactions for which there are no data at all (combinations of levels that don’t occur).



509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
# File 'lib/GECS.rb', line 509

def GECS.enumerateKeys(bagOfHolding,id)
  require 'set'
  throw "Bag is nil" if bagOfHolding.nil?
  throw "Experiments are nil" if bagOfHolding.experiments.nil?
  throw "No such experiment" if id >= bagOfHolding.experiments.length

  exp = bagOfHolding.experiments[id]
  throw "Null experiment" if exp.nil?
  throw "Null indvars" if exp.indVars.nil?
  throw "Null depvars" if exp.depVars.nil?
  numfacs = exp.indVars.length
  numdeps = exp.depVars.length
  throw "Not enough indvars" if numfacs < 1
  throw "Not enough depvars" if numdeps < 1

  if numfacs==1
    # Short cut for single-factor experiments.
    bagOfHolding.data.select{|k,v| k.experimentId==id}.keys
  else
    # Enumerate the levels of all of the factors while adding all of the
    # treatments.
    r = Array.new
    levels = Array.new(numfacs){Set.new}
    bagOfHolding.data.each_key{|k|
	if k.experimentId==id
 r.push(k)
 for fac in 0..numfacs-1
   levels[fac].add(k.treat[fac])
 end
	end
    }

    treat = Array.new(numfacs,nil)
    # Main effects.  Single-factor experiments were already excluded.
    for fac in 0..numfacs-1
	treat.fill(nil)
	for lvl in levels[fac]
 treat[fac] = lvl
 r.push(Key.new(id,Array.new(treat)))
	end
    end
    # 2-way interactions.
    if numfacs > 2  # Don't duplicate the treatments when numfacs==2.
	for fac1 in 0..numfacs-2
 for fac2 in fac1+1..numfacs-1
   treat.fill(nil)
   for lvl1 in levels[fac1]
     treat[fac1] = lvl1
     for lvl2 in levels[fac2]
treat[fac2] = lvl2
              key = Key.new(id,Array.new(treat))
              r.push(key) if matchesSomething(bagOfHolding,key)
     end
   end
 end
	end
    end
    r
  end
end

.load(filename) ⇒ Object

Load a GECS database from a file. Returns a BagOfHolding.



290
291
292
293
294
295
296
# File 'lib/GECS.rb', line 290

def GECS.load(filename)
  temp = Marshal.load(File.open(filename,"r"))
  if temp.version > Version
    raise "File format version is later than GECS.rb version"
  end
  temp.bagOfHolding
end

.matchesSomething(bagOfHolding, key) ⇒ Object

(Private) Return true if a given key matches any data at all.



491
492
493
494
495
496
497
498
499
500
501
502
503
# File 'lib/GECS.rb', line 491

def GECS.matchesSomething(bagOfHolding,key)
  id = key.experimentId
  bagOfHolding.data.each{|k,v|
    if k.experimentId==id
      if treatEq(key.treat,k.treat)
        unless v.nil?
          return true  # Need to check every v[di] too?
        end
      end
    end
  }
  false
end

.newBag(experiments, data) ⇒ Object

Simplify creation of a new database by nilling out the enums. Since enums are looked up using a find-or-create pattern, there is no harm in starting with nils.

experiments

Array of Experiment structs (index by experiment id).

data

Hash from Key to array (per depVars) of arrays (measurement values in chronological order).



304
305
306
# File 'lib/GECS.rb', line 304

def GECS.newBag(experiments,data)
  BagOfHolding.new(nil, nil, nil, nil, nil, experiments, data, nil)
end

.parmRet(bagOfHolding, delta, pod) ⇒ Object

(Private) Create ParmData from ParPod return.



655
656
657
658
659
660
661
662
663
664
# File 'lib/GECS.rb', line 655

def GECS.parmRet(bagOfHolding,delta,pod)
  if pod.out.nil?
    nil
  else
    ParmData.new(pod.out[2].to_f, pod.out[3].to_f, pod.out[4].to_f, {
      bagOfHolding.getOrAddEstMethodParm("bootstrap replica count")=>pod.out[0].to_i,
	bagOfHolding.getOrAddEstMethodParm("numerical tolerance of adaptive bootstrap")=>delta[pod.di],
	bagOfHolding.getOrAddEstMethodParm("estimated attained precision of adaptive bootstrap")=>pod.out[1].to_f})
  end
end

.printEstMethodParms(bagOfHolding, estMethodParms) ⇒ Object

(Private) Helper method for ::describeParms.



445
446
447
448
449
450
451
452
453
# File 'lib/GECS.rb', line 445

def GECS.printEstMethodParms(bagOfHolding,estMethodParms)
  if estMethodParms.nil?
    puts "      nil"
  else
    estMethodParms.each{|k,v|
      print "      ", bagOfHolding.estMethodParms[k], " = ", v, "\n"
    }
  end
end

.quickInterval(data) ⇒ Object

(Private) Calculate the quick interval for the mean of some data. Returns a ParmData.



591
592
593
594
595
596
597
598
599
600
601
602
603
# File 'lib/GECS.rb', line 591

def GECS.quickInterval(data)
  require 'statistics2'
  if data.nil?
    nil
  else
     count = data.length
     sum = data.reduce(:+)
     mean = sum.to_f/count
     variance = data.map{|x| (mean-x)**2}.inject(:+)/(count-1.0)
     meanU = Math.sqrt(variance/count)*Statistics2::ptdist(count-1,0.975)
     ParmData.new(mean,mean-meanU,mean+meanU,nil)
  end
end

.quickMeans(bagOfHolding, id) ⇒ Object

Add the following parameter to the data and effects of a specified experiment: mean, original method, 95% confidence. This is a quick way to summarize results when more complicated options are not needed. All values are computed as Floats with no respect for the original data type or its precision.

bagOfHolding

A BagOfHolding.

id

Experiment id.



613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
# File 'lib/GECS.rb', line 613

def GECS.quickMeans(bagOfHolding,id)
  throw "Bag is nil" if bagOfHolding.nil?
  throw "Experiments are nil" if bagOfHolding.experiments.nil?
  throw "No such experiment" if id >= bagOfHolding.experiments.length

  exp = bagOfHolding.experiments[id]
  throw "Null experiment" if exp.nil?
  throw "Null indvars" if exp.indVars.nil?
  throw "Null depvars" if exp.depVars.nil?
  numfacs = exp.indVars.length
  numdeps = exp.depVars.length
  throw "Not enough indvars" if numfacs < 1
  throw "Not enough depvars" if numdeps < 1

  parmDef = ParmDef.new(bagOfHolding.getOrAddParm("mean"),
    bagOfHolding.getOrAddEstMethod("original"),
    {bagOfHolding.getOrAddEstMethodParm("coverage probability")=>0.95})
  meanId = bagOfHolding.getOrAddParmDef(parmDef)
  bagOfHolding.ests ||= Hash.new
  enumerateKeys(bagOfHolding,id).each{|key|
    bagOfHolding.ests[key] ||= Array.new(numdeps,nil)
    for di in 0..numdeps-1
      cell = refactorExtract(bagOfHolding,key,di)
	unless cell.nil?
 bagOfHolding.ests[key][di] ||= Hash.new
 bagOfHolding.ests[key][di][meanId] = quickInterval(cell)
	end
    end
  }
end

.quotemaybe(oneval) ⇒ Object

(Private) Helper method to quote values that aren’t numeric.

oneval

A single value to be quoted or not. Nil becomes NA.



396
397
398
399
400
401
402
403
404
405
406
407
408
409
# File 'lib/GECS.rb', line 396

def GECS.quotemaybe(oneval)
  if oneval.nil?
    "NA"
  elsif oneval.is_a?(String)
    # R 3.0.2 looks like it is re-escaping strings on input so that \\
    # turns into \\\\, yet this is the minimum amount of escaping that gets
    # everything through read.table without choking.  allowEscapes=F only
    # makes it choke even more.  The worst case seems to be when a string
    # ends with a backslash.
    "\""+oneval.gsub('\\'){'\\\\'}.gsub("\"","\\\"")+"\""
  else
    oneval
  end
end

.quotesome(treat) ⇒ Object

(Private) Helper method to quote values that aren’t numeric. Were they always well-behaved values, k.treat.join(“ ”) would suffice.

treat

An array of values that might need to be quoted.



415
416
417
# File 'lib/GECS.rb', line 415

def GECS.quotesome(treat)
  treat.map{|level| quotemaybe(level)}.join(" ")
end

.refactorExtract(bagOfHolding, key, di) ⇒ Object

(Private) Refactor the data of an experiment according to a specified effect and extract the data from the specified cell. If there are no nils in key, this is just a slow way of doing bagOfHolding.data[di].

di

depvar index



474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
# File 'lib/GECS.rb', line 474

def GECS.refactorExtract(bagOfHolding,key,di)
  id = key.experimentId
  r = nil
  bagOfHolding.data.each{|k,v|
    if k.experimentId==id
      if treatEq(key.treat,k.treat)
        unless v.nil? or v[di].nil?
          r ||= Array.new
	    r.concat(v[di])
        end
      end
    end
  }
  r
end

.save(filename, bagOfHolding) ⇒ Object

Save a GECS database to a file.

bagOfHolding

A BagOfHolding.



285
286
287
# File 'lib/GECS.rb', line 285

def GECS.save(filename, bagOfHolding)
  Marshal.dump(DoubleBag.new(Version,bagOfHolding), open(filename,"w"))
end

.treatEq(a, b) ⇒ Object

(Private) Equality test for treatments that implements nil as wildcard.



460
461
462
463
464
465
466
467
# File 'lib/GECS.rb', line 460

def GECS.treatEq(a,b)
  throw "Nil treatment passed to treatEq" if a.nil? or b.nil?
  throw "Length mismatch" if a.length != b.length
  a.each_index{|i|
    return false if !a[i].nil? and !b[i].nil? and a[i]!=b[i]
  }
  true
end