RedAmber
A simple dataframe library for Ruby (experimental).
Requirements
gem 'red-arrow', '>= 8.0.0'
gem 'red-parquet', '>= 8.0.0' # if you use IO from/to parquet
gem 'rover-df', '~> 0.3.0' # if you use IO from/to Rover::DataFrame
Installation
Install requirements before you install Red Amber.
- Apache Arrow GLib (>= 8.0.0)
- Apache Parquet GLib (>= 8.0.0)
See Apache Arrow install document.
Minimum installation example for the latest Ubuntu is in the 'Prepare the Apache Arrow' section in ci test of Red Amber.
Add this line to your Gemfile:
gem 'red_amber'
And then execute:
bundle install
Or install it yourself as:
gem install red_amber
(From v0.1.6)
RedAmber uses TDR mode for #inspect
and #to_iruby
by default. If you prefer Table mode, please set the environment variable
RED_AMBER_OUTPUT_MODE
to "table"
. See TDR section for detail.
RedAmber::DataFrame
Represents a set of data in 2D-shape. The entity is a Red Arrow's Table object.
require 'red_amber' # require 'red-amber' is also OK.
require 'datasets-arrow'
arrow = Datasets::Penguins.new.to_arrow
penguins = RedAmber::DataFrame.new(arrow)
penguins.table
# =>
#<Arrow::Table:0x111271098 ptr=0x7f9118b3e0b0>
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.100000 18.700000 181 3750 male 2007
1 Adelie Torgersen 39.500000 17.400000 186 3800 female 2007
2 Adelie Torgersen 40.300000 18.000000 195 3250 female 2007
3 Adelie Torgersen (null) (null) (null) (null) (null) 2007
4 Adelie Torgersen 36.700000 19.300000 193 3450 female 2007
5 Adelie Torgersen 39.300000 20.600000 190 3650 male 2007
6 Adelie Torgersen 38.900000 17.800000 181 3625 female 2007
7 Adelie Torgersen 39.200000 19.600000 195 4675 male 2007
8 Adelie Torgersen 34.100000 18.100000 193 3475 (null) 2007
9 Adelie Torgersen 42.000000 20.200000 190 4250 (null) 2007
...
334 Gentoo Biscoe 46.200000 14.100000 217 4375 female 2009
335 Gentoo Biscoe 55.100000 16.000000 230 5850 male 2009
336 Gentoo Biscoe 44.500000 15.700000 217 4875 (null) 2009
337 Gentoo Biscoe 48.800000 16.200000 222 6000 male 2009
338 Gentoo Biscoe 47.200000 13.700000 214 4925 female 2009
339 Gentoo Biscoe (null) (null) (null) (null) (null) 2009
340 Gentoo Biscoe 46.800000 14.300000 215 4850 female 2009
341 Gentoo Biscoe 50.400000 15.700000 222 5750 male 2009
342 Gentoo Biscoe 45.200000 14.800000 212 5200 female 2009
343 Gentoo Biscoe 49.900000 16.100000 213 5400 male 2009
By default, RedAmber shows self by compact transposed style. This unfamiliar style (TDR) is designed for the exploratory data processing. It keeps Vectors as row vectors, shows keys and types at a glance, shows levels for the 'factor-like' variables and shows the number of abnormal values like NaN and nil.
penguins
# =>
RedAmber::DataFrame : 344 x 8 Vectors
Vectors : 5 numeric, 3 strings
# key type level data_preview
1 :species string 3 {"Adelie"=>152, "Chinstrap"=>68, "Gentoo"=>124}
2 :island string 3 {"Torgersen"=>52, "Biscoe"=>168, "Dream"=>124}
3 :bill_length_mm double 165 [39.1, 39.5, 40.3, nil, 36.7, ... ], 2 nils
4 :bill_depth_mm double 81 [18.7, 17.4, 18.0, nil, 19.3, ... ], 2 nils
5 :flipper_length_mm uint8 56 [181, 186, 195, nil, 193, ... ], 2 nils
6 :body_mass_g uint16 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
7 :sex string 3 {"male"=>168, "female"=>165, nil=>11}
8 :year uint16 3 {2007=>110, 2008=>114, 2009=>120}
DataFrame model
For example, DataFrame#pick
accepts keys as an argument and returns a sub DataFrame.
df = penguins.pick(:body_mass_g)
# =>
#<RedAmber::DataFrame : 344 x 1 Vector, 0x000000000000fa14>
Vector : 1 numeric
# key type level data_preview
1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
DataFrame#assign
creates new variables (column in the table).
df.assign(:body_mass_kg => df[:body_mass_g] / 1000.0)
# =>
#<RedAmber::DataFrame : 344 x 2 Vectors, 0x000000000000fa28>
Vectors : 2 numeric
# key type level data_preview
1 :body_mass_g int64 95 [3750, 3800, 3250, nil, 3450, ... ], 2 nils
2 :body_mass_kg double 95 [3.75, 3.8, 3.25, nil, 3.45, ... ], 2 nils
DataFrame manipulating methods like pick
, drop
, slice
, remove
, rename
and assign
accept a block.
This is an exaple to eliminate observations (row in the table) containing nil.
# remove all observation contains nil
nil_removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
nil_removed.tdr
# =>
RedAmber::DataFrame : 342 x 8 Vectors
Vectors : 5 numeric, 3 strings
# key type level data_preview
1 :species string 3 {"Adelie"=>151, "Chinstrap"=>68, "Gentoo"=>123}
2 :island string 3 {"Torgersen"=>51, "Biscoe"=>167, "Dream"=>124}
3 :bill_length_mm double 164 [39.1, 39.5, 40.3, 36.7, 39.3, ... ]
4 :bill_depth_mm double 80 [18.7, 17.4, 18.0, 19.3, 20.6, ... ]
5 :flipper_length_mm int64 55 [181, 186, 195, 193, 190, ... ]
6 :body_mass_g int64 94 [3750, 3800, 3250, 3450, 3650, ... ]
7 :sex string 3 {"male"=>168, "female"=>165, ""=>9}
8 :year int64 3 {2007=>109, 2008=>114, 2009=>119}
For this frequently needed task, we can do it much simpler.
penguins.remove_nil # => same result as above
See DataFrame.md for details.
RedAmber::Vector
Class RedAmber::Vector
represents a series of data in the DataFrame.
penguins[:bill_length_mm]
# =>
#<RedAmber::Vector(:double, size=344):0x000000000000f8fc>
[39.1, 39.5, 40.3, nil, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, ... ]
Vectors accepts some functional methods from Arrow.
See Vector.md for details.
TDR
I named the data frame representation style in the model above as TDR (Transposed DataFrame Representation).
This library can be used with both TDR mode and usual Table mode.
If you set the environment variable RED_AMBER_OUTPUT_MODE
to "table"
, output style by inspect
and to_iruby
is the Table mode. Other value including nil will output TDR style.
You can switch the mode in Ruby like this.
ENV['RED_AMBER_OUTPUT_STYLE'] = 'table' # => Table mode
For more detail information about TDR, see TDR.md.
Development
git clone https://github.com/heronshoes/red_amber.git
cd red_amber
bundle install
bundle exec rake test
License
The gem is available as open source under the terms of the MIT License.