# daru - Data Analysis in RUby

## Introduction

daru (Data Analysis in RUby) is a library for storage, analysis, manipulation and visualization of data in Ruby.

daru makes it easy and intuitive to process data predominantly through 2 data structures: `Daru::DataFrame`

and `Daru::Vector`

. Written in pure Ruby works with all ruby implementations. Tested with MRI 2.0, 2.1, 2.2, 2.3, and 2.4.

## Features

- Data structures:
- Vector - A basic 1-D vector.
- DataFrame - A 2-D spreadsheet-like structure for manipulating and storing data sets. This is daru's primary data structure.

- Compatible with IRuby notebook, statsample, statsample-glm and statsample-timeseries.
- Support for time series.
- Singly and hierarchically indexed data structures.
- Flexible and intuitive API for manipulation and analysis of data.
- Easy plotting, statistics and arithmetic.
- Plentiful iterators.
- Optional speed and space optimization on MRI with NMatrix and GSL.
- Easy splitting, aggregation and grouping of data.
- Quickly reducing data with pivot tables for quick data summary.
- Import and export data from and to Excel, CSV, SQL Databases, ActiveRecord and plain text files.

## Installation

```
$ gem install daru
```

## Notebooks

#### Notebooks on most use cases

- Overview of most daru functions
- Basic Creation of Vectors and DataFrame
- Detailed Usage of Daru::Vector
- Detailed Usage of Daru::DataFrame
- Searching and combining data in daru
- Grouping, Splitting and Pivoting Data
- Usage of Categorical Data

#### Visualization

- Visualizing Data With Daru::DataFrame
- Plotting using Nyaplot
- Plotting using GnuplotRB
- Vector plotting with Gruff
- DataFrame plotting with Gruff

#### Notebooks on Time series

#### Notebooks on Indexing

### Case Studies

- Logistic Regression Analysis with daru and statsample-glm
- Finding and Plotting most heard artists from a Last.fm dataset
- Analyzing baby names with daru
- Example usage of Categorical Data
- Example usage of Categorical Index

## Blog Posts

- Data Analysis in RUby: Basic data manipulation and plotting
- Data Analysis in RUby: Splitting, sorting, aggregating data and data types
- Finding and Combining data in daru

### Time series

### Categorical Data

## Basic Usage

daru exposes two major data structures: `DataFrame`

and `Vector`

. The Vector is a basic 1-D structure corresponding to a labelled Array, while the `DataFrame`

- daru's primary data structure - is 2-D spreadsheet-like structure for manipulating and storing data sets.

Basic DataFrame intitialization.

```
data_frame = Daru::DataFrame.new(
{
'Beer' => ['Kingfisher', 'Snow', 'Bud Light', 'Tiger Beer', 'Budweiser'],
'Gallons sold' => [500, 400, 450, 200, 250]
},
index: ['India', 'China', 'USA', 'Malaysia', 'Canada']
)
data_frame
```

Load data from CSV files.

```
df = Daru::DataFrame.from_csv('TradeoffData.csv')
```

*Basic Data Manipulation*

Selecting rows.

```
data_frame.row['USA']
```

Selecting columns.

```
data_frame['Beer']
```

A range of rows.

```
data_frame.row['India'..'USA']
```

The first 2 rows.

```
data_frame.first(2)
```

The last 2 rows.

```
data_frame.last(2)
```

Adding a new column.

```
data_frame['Gallons produced'] = [550, 500, 600, 210, 240]
```

Creating a new column based on data in other columns.

```
data_frame['Demand supply gap'] = data_frame['Gallons produced'] - data_frame['Gallons sold']
```

*Condition based selection*

Selecting countries based on the number of gallons sold in each. We use a syntax similar to that defined by Arel, i.e. by using the `where`

clause.

```
data_frame.where(data_frame['Gallons sold'].lt(300))
```

You can pass a combination of boolean operations into the `#where`

method and it should work fine:

```
data_frame.where(
data_frame['Beer']
.in(['Snow', 'Kingfisher','Tiger Beer'])
.and(
data_frame['Gallons produced'].gt(520).or(data_frame['Gallons produced'].lt(250))
)
)
```

*Plotting*

Daru supports plotting of interactive graphs with nyaplot. You can easily create a plot with the `#plot`

method. Here we plot the gallons sold on the Y axis and name of the brand on the X axis in a bar graph.

```
data_frame.plot type: :bar, x: 'Beer', y: 'Gallons sold' do |plot, diagram|
plot.x_label "Beer"
plot.y_label "Gallons Sold"
plot.yrange [0,600]
plot.width 500
plot.height 400
end
```

In addition to nyaplot, daru also supports plotting out of the box with gnuplotrb.

## Documentation

Docs can be found here.

## Contributing

Pick a feature from the Roadmap or the issue tracker or think of your own and send me a Pull Request!

For details see CONTRIBUTING.

## Acknowledgements

- Google and the Ruby Science Foundation for the Google Summer of Code 2016 grant for speed enhancements and implementation of support for categorical data. Special thanks to @lokeshh, @zverok and @agisga for their efforts.
- Google and the Ruby Science Foundation for the Google Summer of Code 2015 grant for further developing daru and integrating it with other ruby gems.
- Thank you last.fm for making user data accessible to the public.

Copyright (c) 2015, Sameer Deshmukh All rights reserved