daru Lite - Data Analysis in RUby Lite

Simple, straightforward DataFrames for Ruby

Build Status Gem Version Maintainability Test Coverage

Introduction

daru Lite is a library for data analysis and manipulation in Ruby.

This project started as fork of Daru with the objective to provide :

  • a simple and yet powerfull interface to manipulate data using DataFrames
  • a API consistent with the one historically provided by daru
  • a focus on the core features around data manipulation, droped several cumbersome daru dependencies and the associated features : notably N-Matrix, GSL, R, imagemagick and all plotting libraries. The current project has no major dependencies
  • build a future-proof library that can safely be used in production

Installation

$ gem install daru_lite

or add daru Lite to your Gemfile:

$ bundle add daru_lite

Basic Usage

daru Lite exposes two major data structures: DataFrame and Vector. The Vector is a basic 1-D structure corresponding to a labelled Array, while the DataFrame - daru's primary data structure - is 2-D spreadsheet-like structure for manipulating and storing data sets.

Basic DataFrame intitialization.

data_frame = DaruLite::DataFrame.new(
  {
    'Beer' => ['Kingfisher', 'Snow', 'Bud Light', 'Tiger Beer', 'Budweiser'],
    'Gallons sold' => [500, 400, 450, 200, 250]
  },
  index: ['India', 'China', 'USA', 'Malaysia', 'Canada']
)
data_frame

init0

Load data from CSV files.

df = DaruLite::DataFrame.from_csv('TradeoffData.csv')

init1

Basic Data Manipulation

Selecting rows.

data_frame.row['USA']

man0

Selecting columns.

data_frame['Beer']

man1

A range of rows.

data_frame.row['India'..'USA']

man2

The first 2 rows.

data_frame.first(2)

man3

The last 2 rows.

data_frame.last(2)

man4

Adding a new column.

data_frame['Gallons produced'] = [550, 500, 600, 210, 240]

man5

Creating a new column based on data in other columns.

data_frame['Demand supply gap'] = data_frame['Gallons produced'] - data_frame['Gallons sold']

man6

Condition based selection

Selecting countries based on the number of gallons sold in each. We use a syntax similar to that defined by Arel, i.e. by using the where clause.

data_frame.where(data_frame['Gallons sold'].lt(300))

con0

You can pass a combination of boolean operations into the #where method and it should work fine:

data_frame.where(
  data_frame['Beer']
  .in(['Snow', 'Kingfisher','Tiger Beer'])
  .and(
    data_frame['Gallons produced'].gt(520).or(data_frame['Gallons produced'].lt(250))
  )
)

con1

Documentation

Docs can be found here.