daru

Data Analysis in RUby

Gem Version Build Status

Introduction

daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data.

daru is inspired by pandas, a very mature solution in Python.

Written in pure Ruby so should work with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2.

Features

  • Data structures:
    • Vector - A basic 1-D vector.
    • DataFrame - A 2-D spreadsheet-like structure for manipulating and storing data sets. This is daru's primary data structure.
  • Compatible with IRuby notebook and statsample.
  • Singly and hierarchially indexed data structures.
  • Flexible and intuitive API for manipulation and analysis of data.
  • Easy plotting, statistics and arithmetic.
  • Plentiful iterators.
  • Optional speed and space optimization on MRI with NMatrix and GSL.
  • Easy splitting, aggregation and grouping of data.
  • Quickly reducing data with pivot tables for quick data summary.
  • Import and exports dataset from and to Excel, CSV, Databases and plain text files.

Notebooks

Usage

Case Studies

Blog Posts

Documentation

Docs can be found here.

Roadmap

  • Automate testing for both MRI and JRuby.
  • Enable creation of DataFrame by only specifying an NMatrix/MDArray in initialize. Vector naming happens automatically (alphabetic) or is specified in an Array.
  • Completely test all functionality for MDArray.
  • Basic Data manipulation and analysis operations:
    • DF concat
  • Option to express a DataFrame as an NMatrix or MDArray so as to use more efficient storage techniques.
  • Assignment of a column to a single number should set the entire column to that number.
  • == between daru_vector and string/number.
  • Multiple column assignment with []=
  • Multiple value assignment for vectors with []=.
  • #find_max function which will evaluate a block and return the row for the value of the block is max.
  • Function to check if a value of a row/vector is within a specified range.
  • Create a new vector in map_rows if any of the already present rows dont match the one assigned in the block.
  • Sort by index.
  • Statistics on DataFrame over rows and columns.
  • Cumulative sum.
  • Time series support.
  • Calculate percentage change.
  • Have some sample data sets for users to play around with. Should be able to load these from the code itself.
  • Sorting with missing data present.
  • re_index should re establish previous index values in the newly supplied index.

Contributing

Pick a feature from the Roadmap or the issue tracker or think of your own and send me a Pull Request!

Acknowledgements

  • Google and the Ruby Science Foundation for the Google Summer of Code 2015 grant for further developing daru and integrating it with other ruby gems.
  • Thank you last.fm for making user data accessible to the public.

Copyright (c) 2015, Sameer Deshmukh All rights reserved