daru

Data Analysis in RUby

Gem Version

Introduction

daru (Data Analysis in RUby) is a library for storage, analysis and manipulation of data. It aims to be the preferred data analysis library for Ruby.

Development of daru was started to address the fragmentation of Dataframe-like classes which were created in many ruby gems as per their own needs.

This creates a hurdle in using these gems together to solve a problem. For example, calculating something in statsample and plotting the results in Nyaplot.

daru is heavily inspired by Statsample::Dataset, Nyaplot::DataFrame and the super-awesome pandas, a very mature solution in Python.

Data Structures

daru employs several data structures for storing and manipulating data:

  • Vector - A basic 1-D vector.
  • DataFrame - A 2-D matrix-like structure which is internally composed of named Vector classes.

daru data structures can be constructed by using several Ruby classes. These include Array, Hash, Matrix, NMatrix and MDArray. daru brings a uniform API for handling and manipulating data represented in any of the above Ruby classes.

Testing

Install jruby using rvm install jruby, then run jruby -S gem install mdarray, followed by bundle install. You will need to install mdarray manually because of strange gemspec file behaviour. If anyone can automate this then I'd greatly appreciate it! Then run rspec in JRuby to test for MDArray functionality.

Then switch to MRI, do a normal bundle install followed by rspec for testing everything else with NMatrix functionality.

Roadmap

  • Print dataframe in same order as fields.
  • Return rows in same order as that of fields.
  • Automate testing for both MRI and JRuby.
  • Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
  • Add support for missing values in vectors.
  • Add normal and destructive map iterators.
  • Completely test all functionality for NMatrix and MDArray.
  • Basic Data manipulation and analysis operations:
    • Different kinds of join operations
    • Dataframe/vector merge
    • Creation of correlation, covariance matrices
    • Verification of data in a vector
    • Basic vector statistics - mean, median, variance, etc.
  • Add indexing on vectors.
    • Creation of vector by supplying an index-value hash.
    • Auto generation of real numbered indices for any vector.
    • Ability to separately specify index for each element of a vector.
    • Runtime alteration of index.
  • Indexing on DataFrame.
  • Vector arithmetic - elementwise addition, subtraction, multiplication, division.
  • Transpose a dataframe.