Class: DaruLite::DataFrame

Inherits:
Object show all
Extended by:
Gem::Deprecate
Includes:
Maths::Arithmetic::DataFrame, Maths::Statistics::DataFrame
Defined in:
lib/daru_lite/dataframe.rb,
lib/daru_lite/extensions/which_dsl.rb

Overview

rubocop:disable Metrics/ClassLength

Defined Under Namespace

Modules: SetCategoricalIndexStrategy, SetMultiIndexStrategy, SetSingleIndexStrategy

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Maths::Statistics::DataFrame

#acf, #correlation, #count, #covariance, #cumsum, #describe, #ema, #max, #mean, #median, #min, #mode, #percent_change, #product, #range, #rolling_count, #rolling_max, #rolling_mean, #rolling_median, #rolling_min, #rolling_std, #rolling_variance, #standardize, #std, #sum, #variance_sample

Methods included from Maths::Arithmetic::DataFrame

#%, #*, #**, #+, #-, #/, #exp, #round, #sqrt

Constructor Details

#initialize(source = {}, opts = {}) ⇒ DataFrame

DataFrame basically consists of an Array of Vector objects. These objects are indexed by row and column by vectors and index Index objects.

Arguments

  • source - Source from the DataFrame is to be initialized. Can be a Hash

of names and vectors (array or DaruLite::Vector), an array of arrays or array of DaruLite::Vectors.

Options

:order - An Array/DaruLite::Index/DaruLite::MultiIndex containing the order in which Vectors should appear in the DataFrame.

:index - An Array/DaruLite::Index/DaruLite::MultiIndex containing the order in which rows of the DataFrame will be named.

:name - A name for the DataFrame.

:clone - Specify as true or false. When set to false, and Vector objects are passed for the source, the Vector objects will not duplicated when creating the DataFrame. Will have no effect if Array is passed in the source, or if the passed DaruLite::Vectors have different indexes. Default to true.

Usage

df = DaruLite::DataFrame.new
# =>
# <DaruLite::DataFrame(0x0)>
# Creates an empty DataFrame with no rows or columns.

df = DaruLite::DataFrame.new({}, order: [:a, :b])
#<DaruLite::DataFrame(0x2)>
  a   b
# Creates a DataFrame with no rows and columns :a and :b

df = DaruLite::DataFrame.new({a: [1,2,3,4], b: [6,7,8,9]}, order: [:b, :a],
  index: [:a, :b, :c, :d], name: :spider_man)

# =>
# <DaruLite::DataFrame:80766980 @name = spider_man @size = 4>
#             b          a
#  a          6          1
#  b          7          2
#  c          8          3
#  d          9          4

df = DaruLite::DataFrame.new([[1,2,3,4],[6,7,8,9]], name: :bat_man)

# =>
# #<DaruLite::DataFrame: bat_man (4x2)>
#             0          1
#  0          1          6
#  1          2          7
#  2          3          8
#  3          4          9

# Dataframe having Index name

df = DaruLite::DataFrame.new({a: [1,2,3,4], b: [6,7,8,9]}, order: [:b, :a],
  index: DaruLite::Index.new([:a, :b, :c, :d], name: 'idx_name'),
  name: :spider_man)

# =>
# <DaruLite::DataFrame:80766980 @name = spider_man @size = 4>
# idx_name            b          a
#        a          6          1
#        b          7          2
#        c          8          3
#        d          9          4

idx = DaruLite::Index.new [100, 99, 101, 1, 2], name: "s1"
=> #<DaruLite::Index(5): s1 {100, 99, 101, 1, 2}>

df = DaruLite::DataFrame.new({b: [11,12,13,14,15], a: [1,2,3,4,5],
  c: [11,22,33,44,55]},
  order: [:a, :b, :c],
  index: idx)
 # =>
 #<DaruLite::DataFrame(5x3)>
 #   s1   a   b   c
 #  100   1  11  11
 #   99   2  12  22
 #  101   3  13  33
 #    1   4  14  44
 #    2   5  15  55


299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
# File 'lib/daru_lite/dataframe.rb', line 299

def initialize(source = {}, opts = {})
  vectors = opts[:order]
  index = opts[:index] # FIXME: just keyword arges after Ruby 2.1
  @data = []
  @name = opts[:name]

  case source
  when [], {}
    create_empty_vectors(vectors, index)
  when Array
    initialize_from_array source, vectors, index, opts
  when Hash
    initialize_from_hash source, vectors, index, opts
  end

  set_size
  validate
  update
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args, &block) ⇒ Object



2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
# File 'lib/daru_lite/dataframe.rb', line 2251

def method_missing(name, *args, &block)
  if /(.+)=/.match?(name)
    name = name[/(.+)=/].delete('=')
    name = name.to_sym unless has_vector?(name)
    insert_or_modify_vector [name], args[0]
  elsif has_vector?(name)
    self[name]
  elsif has_vector?(name.to_s)
    self[name.to_s]
  else
    super
  end
end

Instance Attribute Details

#dataObject (readonly)

TOREMOVE



199
200
201
# File 'lib/daru_lite/dataframe.rb', line 199

def data
  @data
end

#indexObject

The index of the rows of the DataFrame



202
203
204
# File 'lib/daru_lite/dataframe.rb', line 202

def index
  @index
end

#nameObject (readonly)

The name of the DataFrame



205
206
207
# File 'lib/daru_lite/dataframe.rb', line 205

def name
  @name
end

#sizeObject (readonly)

The number of rows present in the DataFrame



208
209
210
# File 'lib/daru_lite/dataframe.rb', line 208

def size
  @size
end

#vectorsObject

The vectors (columns) index of the DataFrame



197
198
199
# File 'lib/daru_lite/dataframe.rb', line 197

def vectors
  @vectors
end

Class Method Details

._load(data) ⇒ Object



2184
2185
2186
2187
2188
2189
2190
# File 'lib/daru_lite/dataframe.rb', line 2184

def self._load(data)
  h = Marshal.load data
  DaruLite::DataFrame.new(h[:data],
                          index: h[:index],
                          order: h[:order],
                          name: h[:name])
end

.crosstab_by_assignation(rows, columns, values) ⇒ Object

Generates a new dataset, using three vectors

  • Rows

  • Columns

  • Values

For example, you have these values

x   y   v
a   a   0
a   b   1
b   a   1
b   b   0

You obtain

id  a   b
 a  0   1
 b  1   0

Useful to process outputs from databases



155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# File 'lib/daru_lite/dataframe.rb', line 155

def crosstab_by_assignation(rows, columns, values)
  raise 'Three vectors should be equal size' if
    rows.size != columns.size || rows.size != values.size

  data = Hash.new do |h, col|
    h[col] = rows.factors.map { |r| [r, nil] }.to_h
  end
  columns.zip(rows, values).each { |c, r, v| data[c][r] = v }

  # FIXME: in fact, WITHOUT this line you'll obtain more "right"
  # data: with vectors having "rows" as an index...
  data = data.transform_values(&:values)
  data[:_id] = rows.factors

  DataFrame.new(data)
end

.from_activerecord(relation, *fields) ⇒ Object

Read a dataframe from AR::Relation

USE:

# When Post model is defined as:
class Post < ActiveRecord::Base
  scope :active, -> { where.not(published_at: nil) }
end

# You can load active posts into a dataframe by:
DaruLite::DataFrame.from_activerecord(Post.active, :title, :published_at)

Parameters:

  • relation (ActiveRecord::Relation)

    An AR::Relation object from which data is loaded

  • fields (Array)

    Field names to be loaded (optional)

Returns:

  • A dataframe containing the data loaded from the relation



99
100
101
# File 'lib/daru_lite/dataframe.rb', line 99

def from_activerecord(relation, *fields)
  DaruLite::IO.from_activerecord relation, *fields
end

.from_csv(path, opts = {}, &block) ⇒ Object

Load data from a CSV file. Specify an optional block to grab the CSV object and pre-condition it (for example use the ‘convert` or `header_convert` methods).

Arguments

  • path - Local path / Remote URL of the file to load specified as a String.

Options

Accepts the same options as the DaruLite::DataFrame constructor and CSV.open() and uses those to eventually construct the resulting DataFrame.

Verbose Description

You can specify all the options to the ‘.from_csv` function that you do to the Ruby `CSV.read()` function, since this is what is used internally.

For example, if the columns in your CSV file are separated by something other that commas, you can use the ‘:col_sep` option. If you want to convert numeric values to numbers and not keep them as strings, you can use the `:converters` option and set it to `:numeric`.

The ‘.from_csv` function uses the following defaults for reading CSV files (that are passed into the `CSV.read()` function):

{
  :col_sep           => ',',
  :converters        => :numeric
}


46
47
48
# File 'lib/daru_lite/dataframe.rb', line 46

def from_csv(path, opts = {}, &block)
  DaruLite::IO.from_csv path, opts, &block
end

.from_excel(path, opts = {}, &block) ⇒ Object

Read data from an Excel file into a DataFrame.

Arguments

  • path - Path of the file to be read.

Options

*:worksheet_id - ID of the worksheet that is to be read.



59
60
61
# File 'lib/daru_lite/dataframe.rb', line 59

def from_excel(path, opts = {}, &block)
  DaruLite::IO.from_excel path, opts, &block
end

.from_plaintext(path, fields) ⇒ Object

Read the database from a plaintext file. For this method to work, the data should be present in a plain text file in columns. See spec/fixtures/bank2.dat for an example.

Arguments

  • path - Path of the file to be read.

  • fields - Vector names of the resulting database.

Usage

df = DaruLite::DataFrame.from_plaintext 'spec/fixtures/bank2.dat', [:v1,:v2,:v3,:v4,:v5,:v6]


115
116
117
# File 'lib/daru_lite/dataframe.rb', line 115

def from_plaintext(path, fields)
  DaruLite::IO.from_plaintext path, fields
end

.from_sql(dbh, query) ⇒ Object

Read a database query and returns a Dataset

USE:

dbh = DBI.connect("DBI:Mysql:database:localhost", "user", "password")
DaruLite::DataFrame.from_sql(dbh, "SELECT * FROM test")

#Alternatively

require 'dbi'
DaruLite::DataFrame.from_sql("path/to/sqlite.db", "SELECT * FROM test")

Parameters:

  • dbh (DBI::DatabaseHandle, String)

    A DBI connection OR Path to a SQlite3 database.

  • query (String)

    The query to be executed

Returns:

  • A dataframe containing the data resulting from the query



79
80
81
# File 'lib/daru_lite/dataframe.rb', line 79

def from_sql(dbh, query)
  DaruLite::IO.from_sql dbh, query
end

.rows(source, opts = {}) ⇒ Object

Create DataFrame by specifying rows as an Array of Arrays or Array of DaruLite::Vector objects.

Raises:



121
122
123
124
125
126
127
128
129
130
131
132
133
134
# File 'lib/daru_lite/dataframe.rb', line 121

def rows(source, opts = {})
  raise SizeError, 'All vectors must have same length' \
    unless source.all? { |v| v.size == source.first.size }

  opts[:order] ||= guess_order(source)

  if ArrayHelper.array_of?(source, Array) || source.empty?
    DataFrame.new(source.transpose, opts)
  elsif ArrayHelper.array_of?(source, Vector)
    from_vector_rows(source, opts)
  else
    raise ArgumentError, "Can't create DataFrame from #{source}"
  end
end

Instance Method Details

#==(other) ⇒ Object



2226
2227
2228
2229
2230
2231
2232
# File 'lib/daru_lite/dataframe.rb', line 2226

def ==(other)
  self.class == other.class   &&
    @size    == other.size    &&
    @index   == other.index   &&
    @vectors == other.vectors &&
    @vectors.to_a.all? { |v| self[v] == other[v] }
end

#[](*names) ⇒ Object

Access row or vector. Specify name of row/vector followed by axis(:row, :vector). Defaults to :vector. Use of this method is not recommended for accessing rows. Use df.row for accessing row with index ‘:a’.



322
323
324
325
# File 'lib/daru_lite/dataframe.rb', line 322

def [](*names)
  axis = extract_axis(names, :vector)
  dispatch_to_axis axis, :access, *names
end

#[]=(*args) ⇒ Object

Insert a new row/vector of the specified name or modify a previous row. Instead of using this method directly, use df.row = [1,2,3] to set/create a row ‘:a’ to [1,2,3], or df.vector = [1,2,3] for vectors.

In case a DaruLite::Vector is specified after the equality the sign, the indexes of the vector will be matched against the row/vector indexes of the DataFrame before an insertion is performed. Unmatched indexes will be set to nil.



464
465
466
467
468
469
470
# File 'lib/daru_lite/dataframe.rb', line 464

def []=(*args)
  vector = args.pop
  axis = extract_axis(args)
  names = args

  dispatch_to_axis axis, :insert_or_modify, names, vector
end

#_dump(_depth) ⇒ Object



2175
2176
2177
2178
2179
2180
2181
2182
# File 'lib/daru_lite/dataframe.rb', line 2175

def _dump(_depth)
  Marshal.dump(
    data: @data,
    index: @index.to_a,
    order: @vectors.to_a,
    name: @name
  )
end

#access_row_tuples_by_indexs(*indexes) ⇒ Array

Returns array of row tuples at given index(s)

Examples:

Using DaruLite::Index

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'a', 'b']
})

df.access_row_tuples_by_indexs(1,2)
# => [[2, "a"], [3, "b"]]

df.index = DaruLite::Index.new([:one,:two,:three])
df.access_row_tuples_by_indexs(:one,:three)
# => [[1, "a"], [3, "b"]]

Using DaruLite::MultiIndex

mi_idx = DaruLite::MultiIndex.from_tuples [
  [:a,:one,:bar],
  [:a,:one,:baz],
  [:b,:two,:bar],
  [:a,:two,:baz],
]
df_mi = DaruLite::DataFrame.new({
  a: 1..4,
  b: 'a'..'d'
}, index: mi_idx )

df_mi.access_row_tuples_by_indexs(:b, :two, :bar)
# => [[3, "c"]]
df_mi.access_row_tuples_by_indexs(:a)
# => [[1, "a"], [2, "b"], [4, "d"]]

Parameters:

  • indexes (Array)

    index(s) at which row tuples are retrieved

Returns:

  • (Array)

    returns array of row tuples at given index(s)



2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
# File 'lib/daru_lite/dataframe.rb', line 2340

def access_row_tuples_by_indexs(*indexes)
  return get_sub_dataframe(indexes, by_position: false).map_rows(&:to_a) if
  @index.is_a?(DaruLite::MultiIndex)

  positions = @index.pos(*indexes)
  if positions.is_a? Numeric
    row = get_rows_for([positions])
    row.first.is_a?(Array) ? row : [row]
  else
    new_rows = get_rows_for(indexes, by_position: false)
    indexes.map { |index| new_rows.map { |r| r[index] } }
  end
end

#add_level_to_vectors(top_level_label) ⇒ Object

Converts the vectors to a DaruLite::MultiIndex. The argument passed is used as the MultiIndex’s top level



1697
1698
1699
1700
# File 'lib/daru_lite/dataframe.rb', line 1697

def add_level_to_vectors(top_level_label)
  tuples = vectors.map { |label| [top_level_label, *label] }
  self.vectors = DaruLite::MultiIndex.from_tuples(tuples)
end

#add_row(row, index = nil) ⇒ Object



472
473
474
# File 'lib/daru_lite/dataframe.rb', line 472

def add_row(row, index = nil)
  self.row[*(index || @size)] = row
end

#add_vector(n, vector) ⇒ Object



476
477
478
# File 'lib/daru_lite/dataframe.rb', line 476

def add_vector(n, vector)
  self[n] = vector
end

#add_vectors_by_split(name, join = '-', sep = DaruLite::SPLIT_TOKEN) ⇒ Object



1271
1272
1273
1274
1275
# File 'lib/daru_lite/dataframe.rb', line 1271

def add_vectors_by_split(name, join = '-', sep = DaruLite::SPLIT_TOKEN)
  self[name]
    .split_by_separator(sep)
    .each { |k, v| self[:"#{name}#{join}#{k}"] = v }
end

#add_vectors_by_split_recode(nm, join = '-', sep = DaruLite::SPLIT_TOKEN) ⇒ Object



2001
2002
2003
2004
2005
2006
2007
2008
# File 'lib/daru_lite/dataframe.rb', line 2001

def add_vectors_by_split_recode(nm, join = '-', sep = DaruLite::SPLIT_TOKEN)
  self[nm]
    .split_by_separator(sep)
    .each_with_index do |(k, v), i|
      v.rename "#{nm}:#{k}"
      self[:"#{nm}#{join}#{i + 1}"] = v
    end
end

#aggregate(options = {}, multi_index_level = -1)) ⇒ DaruLite::DataFrame

Function to use for aggregating the data.

Note: ‘GroupBy` class `aggregate` method uses this `aggregate` method internally.

Examples:

df = DaruLite::DataFrame.new(
   {col: [:a, :b, :c, :d, :e], num: [52,12,07,17,01]})
=> #<DaruLite::DataFrame(5x2)>
     col num
   0   a  52
   1   b  12
   2   c   7
   3   d  17
   4   e   1

 df.aggregate(num_100_times: ->(df) { (df.num*100).first })
=> #<DaruLite::DataFrame(5x1)>
            num_100_ti
          0       5200
          1       1200
          2        700
          3       1700
          4        100

When we have duplicate index :

idx = DaruLite::CategoricalIndex.new [:a, :b, :a, :a, :c]
df = DaruLite::DataFrame.new({num: [52,12,07,17,01]}, index: idx)
=> #<DaruLite::DataFrame(5x1)>
     num
   a  52
   b  12
   a   7
   a  17
   c   1

df.aggregate(num: :mean)
=> #<DaruLite::DataFrame(3x1)>
                   num
          a 25.3333333
          b         12
          c          1

Parameters:

  • options (Hash) (defaults to: {})

    options for column, you want in resultant dataframe

Returns:



2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
# File 'lib/daru_lite/dataframe.rb', line 2401

def aggregate(options = {}, multi_index_level = -1)
  if block_given?
    positions_tuples, new_index = yield(@index) # NOTE: use of yield is private for now
  else
    positions_tuples, new_index = group_index_for_aggregation(@index, multi_index_level)
  end

  colmn_value = aggregate_by_positions_tuples(options, positions_tuples)

  DaruLite::DataFrame.new(colmn_value, index: new_index, order: options.keys)
end

#all?(axis = :vector, &block) ⇒ Boolean

Works like Array#all?

Examples:

Using all?

df = DaruLite::DataFrame.new({a: [1,2,3,4,5], b: ['a', 'b', 'c', 'd', 'e']})
df.all?(:row) do |row|
  row[:a] < 10
end #=> true

Parameters:

  • axis (Symbol) (defaults to: :vector)

    (:vector) The axis to iterate over. Can be :vector or :row. A DaruLite::Vector object is yielded in the block.

Returns:

  • (Boolean)


1328
1329
1330
1331
1332
1333
1334
1335
1336
# File 'lib/daru_lite/dataframe.rb', line 1328

def all?(axis = :vector, &block)
  if %i[vector column].include?(axis)
    @data.all?(&block)
  elsif axis == :row
    each_row.all?(&block)
  else
    raise ArgumentError, "Unidentified axis #{axis}"
  end
end

#any?(axis = :vector, &block) ⇒ Boolean

Works like Array#any?.

Examples:

Using any?

df = DaruLite::DataFrame.new({a: [1,2,3,4,5], b: ['a', 'b', 'c', 'd', 'e']})
df.any?(:row) do |row|
  row[:a] < 3 and row[:b] == 'b'
end #=> true

Parameters:

  • axis (Symbol) (defaults to: :vector)

    (:vector) The axis to iterate over. Can be :vector or :row. A DaruLite::Vector object is yielded in the block.

Returns:

  • (Boolean)


1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
# File 'lib/daru_lite/dataframe.rb', line 1306

def any?(axis = :vector, &block)
  if %i[vector column].include?(axis)
    @data.any?(&block)
  elsif axis == :row
    each_row do |row|
      return true if yield(row)
    end
    false
  else
    raise ArgumentError, "Unidentified axis #{axis}"
  end
end

#apply_method(method, keys: nil, by_position: true) ⇒ Object Also known as: apply_method_on_sub_df



957
958
959
960
961
962
963
964
965
966
# File 'lib/daru_lite/dataframe.rb', line 957

def apply_method(method, keys: nil, by_position: true)
  df = keys ? get_sub_dataframe(keys, by_position: by_position) : self

  case method
  when Symbol then df.send(method)
  when Proc   then method.call(df)
  when Array  then method.map(&:to_proc).map { |proc| proc.call(df) } # works with Array of both Symbol and/or Proc
  else raise
  end
end

#at(*positions) ⇒ DaruLite::Vector, DaruLite::DataFrame

Retrive vectors by positions

Examples:

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'b', 'c']
})
df.at 0
# => #<DaruLite::Vector(3)>
#       a
#   0   1
#   1   2
#   2   3

Parameters:

  • positions (Array<Integer>)

    of vectors to retrive

Returns:



402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
# File 'lib/daru_lite/dataframe.rb', line 402

def at(*positions)
  if AXES.include? positions.last
    axis = positions.pop
    return row_at(*positions) if axis == :row
  end

  original_positions = positions
  positions = coerce_positions(*positions, ncols)
  validate_positions(*positions, ncols)

  if positions.is_a? Integer
    @data[positions].dup
  else
    DaruLite::DataFrame.new positions.map { |pos| @data[pos].dup },
                            index: @index,
                            order: @vectors.at(*original_positions),
                            name: @name
  end
end

#bootstrap(n = nil) ⇒ DaruLite::DataFrame

Creates a DataFrame with the random data, of n size. If n not given, uses original number of rows.

Returns:



1052
1053
1054
1055
1056
1057
1058
1059
1060
# File 'lib/daru_lite/dataframe.rb', line 1052

def bootstrap(n = nil)
  n ||= nrows
  DaruLite::DataFrame.new({}, order: @vectors).tap do |df_boot|
    n.times do
      df_boot.add_row(row[rand(n)])
    end
    df_boot.update
  end
end

#clone(*vectors_to_clone) ⇒ Object

Returns a ‘view’ of the DataFrame, i.e the object ID’s of vectors are preserved.

Arguments

vectors_to_clone - Names of vectors to clone. Optional. Will return a view of the whole data frame otherwise.



542
543
544
545
546
547
548
# File 'lib/daru_lite/dataframe.rb', line 542

def clone(*vectors_to_clone)
  vectors_to_clone.flatten! if ArrayHelper.array_of?(vectors_to_clone, Array)
  vectors_to_clone = @vectors.to_a if vectors_to_clone.empty?

  h = vectors_to_clone.map { |vec| [vec, self[vec]] }.to_h
  DaruLite::DataFrame.new(h, clone: false, order: vectors_to_clone, name: @name)
end

#clone_only_validObject

Returns a ‘shallow’ copy of DataFrame if missing data is not present, or a full copy of only valid data if missing data is present.



552
553
554
555
556
557
558
# File 'lib/daru_lite/dataframe.rb', line 552

def clone_only_valid
  if include_values?(*DaruLite::MISSING_VALUES)
    reject_values(*DaruLite::MISSING_VALUES)
  else
    clone
  end
end

#clone_structureObject

Only clone the structure of the DataFrame.



531
532
533
# File 'lib/daru_lite/dataframe.rb', line 531

def clone_structure
  DaruLite::DataFrame.new([], order: @vectors.dup, index: @index.dup, name: @name)
end

#collect(axis = :vector, &block) ⇒ Object

Iterate over a row or vector and return results in a DaruLite::Vector. Specify axis with :vector or :row. Default to :vector.

Description

The #collect iterator works similar to #map, the only difference being that it returns a DaruLite::Vector comprising of the results of each block run. The resultant Vector has the same index as that of the axis over which collect has iterated. It also accepts the optional axis argument.

Arguments

  • axis - The axis to iterate over. Can be :vector (or :column)

or :row. Default to :vector.



796
797
798
# File 'lib/daru_lite/dataframe.rb', line 796

def collect(axis = :vector, &block)
  dispatch_to_axis_pl axis, :collect, &block
end

#collect_matrix::Matrix

Generate a matrix, based on vector names of the DataFrame.

:nocov: FIXME: Even not trying to cover this: I can’t get, how it is expected to work.… – zverok

Returns:



1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
# File 'lib/daru_lite/dataframe.rb', line 1003

def collect_matrix
  return to_enum(:collect_matrix) unless block_given?

  vecs = vectors.to_a
  rows = vecs.collect do |row|
    vecs.collect do |col|
      yield row, col
    end
  end

  Matrix.rows(rows)
end

#collect_row_with_index(&block) ⇒ Object



977
978
979
980
981
# File 'lib/daru_lite/dataframe.rb', line 977

def collect_row_with_index(&block)
  return to_enum(:collect_row_with_index) unless block

  DaruLite::Vector.new(each_row_with_index.map(&block), index: @index)
end

#collect_rows(&block) ⇒ Object

Retrieves a DaruLite::Vector, based on the result of calculation performed on each row.



971
972
973
974
975
# File 'lib/daru_lite/dataframe.rb', line 971

def collect_rows(&block)
  return to_enum(:collect_rows) unless block

  DaruLite::Vector.new(each_row.map(&block), index: @index)
end

#collect_vector_with_index(&block) ⇒ Object



991
992
993
994
995
# File 'lib/daru_lite/dataframe.rb', line 991

def collect_vector_with_index(&block)
  return to_enum(:collect_vector_with_index) unless block

  DaruLite::Vector.new(each_vector_with_index.map(&block), index: @vectors)
end

#collect_vectors(&block) ⇒ Object

Retrives a DaruLite::Vector, based on the result of calculation performed on each vector.



985
986
987
988
989
# File 'lib/daru_lite/dataframe.rb', line 985

def collect_vectors(&block)
  return to_enum(:collect_vectors) unless block

  DaruLite::Vector.new(each_vector.map(&block), index: @vectors)
end

#compute(text, &block) ⇒ Object

Returns a vector, based on a string with a calculation based on vector.

The calculation will be eval’ed, so you can put any variable or expression valid on ruby.

For example:

a = DaruLite::Vector.new [1,2]
b = DaruLite::Vector.new [3,4]
ds = DaruLite::DataFrame.new({:a => a,:b => b})
ds.compute("a+b")
=> Vector [4,6]


1195
1196
1197
1198
1199
# File 'lib/daru_lite/dataframe.rb', line 1195

def compute(text, &block)
  return instance_eval(&block) if block

  instance_eval(text)
end

#concat(other_df) ⇒ Object

Concatenate another DataFrame along corresponding columns. If columns do not exist in both dataframes, they are filled with nils



1481
1482
1483
1484
1485
1486
1487
1488
1489
# File 'lib/daru_lite/dataframe.rb', line 1481

def concat(other_df)
  vectors = (@vectors.to_a + other_df.vectors.to_a).uniq

  data = vectors.map do |v|
    get_vector_anyways(v).dup.concat(other_df.get_vector_anyways(v))
  end

  DaruLite::DataFrame.new(data, order: vectors)
end

#create_sql(table, charset = 'UTF8') ⇒ Object

Create a sql, basen on a given Dataset

Arguments

  • table - String specifying name of the table that will created in SQL.

  • charset - Character set. Default is “UTF8”.

Examples:


ds = DaruLite::DataFrame.new({
 :id   => DaruLite::Vector.new([1,2,3,4,5]),
 :name => DaruLite::Vector.new(%w{Alex Peter Susan Mary John})
})
ds.create_sql('names')
 #=>"CREATE TABLE names (id INTEGER,\n name VARCHAR (255)) CHARACTER SET=UTF8;"


2026
2027
2028
2029
2030
2031
2032
2033
2034
# File 'lib/daru_lite/dataframe.rb', line 2026

def create_sql(table, charset = 'UTF8')
  sql    = "CREATE TABLE #{table} ("
  fields = vectors.to_a.collect do |f|
    v = self[f]
    "#{f} #{v.db_type}"
  end

  sql + fields.join(",\n ") + ") CHARACTER SET=#{charset};"
end

#delete_row(index) ⇒ Object

Delete a row

Raises:

  • (IndexError)


1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
# File 'lib/daru_lite/dataframe.rb', line 1035

def delete_row(index)
  idx = named_index_for index

  raise IndexError, "Index #{index} does not exist." unless @index.include? idx

  @index = DaruLite::Index.new(@index.to_a - [idx])
  each_vector do |vector|
    vector.delete_at idx
  end

  set_size
end

#delete_vector(vector) ⇒ Object

Delete a vector

Raises:

  • (IndexError)


1018
1019
1020
1021
1022
1023
1024
1025
# File 'lib/daru_lite/dataframe.rb', line 1018

def delete_vector(vector)
  raise IndexError, "Vector #{vector} does not exist." unless @vectors.include?(vector)

  @data.delete_at @vectors[vector]
  @vectors = DaruLite::Index.new @vectors.to_a - [vector]

  self
end

#delete_vectors(*vectors) ⇒ Object

Deletes a list of vectors



1028
1029
1030
1031
1032
# File 'lib/daru_lite/dataframe.rb', line 1028

def delete_vectors(*vectors)
  Array(vectors).each { |vec| delete_vector vec }

  self
end

#dup(vectors_to_dup = nil) ⇒ Object

Duplicate the DataFrame entirely.

Arguments

  • vectors_to_dup - An Array specifying the names of Vectors to

be duplicated. Will duplicate the entire DataFrame if not specified.



521
522
523
524
525
526
527
528
# File 'lib/daru_lite/dataframe.rb', line 521

def dup(vectors_to_dup = nil)
  vectors_to_dup ||= @vectors.to_a

  src = vectors_to_dup.map { |vec| @data[@vectors.pos(vec)].dup }
  new_order = DaruLite::Index.new(vectors_to_dup)

  DaruLite::DataFrame.new src, order: new_order, index: @index.dup, name: @name, clone: true
end

#dup_only_valid(vecs = nil) ⇒ Object

Creates a new duplicate dataframe containing only rows without a single missing value.



562
563
564
565
566
567
568
569
# File 'lib/daru_lite/dataframe.rb', line 562

def dup_only_valid(vecs = nil)
  rows_with_nil = @data.map { |vec| vec.indexes(*DaruLite::MISSING_VALUES) }
                       .inject(&:concat)
                       .uniq

  row_indexes = @index.to_a
  (vecs.nil? ? self : dup(vecs)).row[*(row_indexes - rows_with_nil)]
end

#each(axis = :vector, &block) ⇒ Object

Iterate over each row or vector of the DataFrame. Specify axis by passing :vector or :row as the argument. Default to :vector.

Description

‘#each` works exactly like Array#each. The default mode for `each` is to iterate over the columns of the DataFrame. To iterate over rows you must pass the axis, i.e `:row` as an argument.

Arguments

  • axis - The axis to iterate over. Can be :vector (or :column)

or :row. Default to :vector.



777
778
779
# File 'lib/daru_lite/dataframe.rb', line 777

def each(axis = :vector, &block)
  dispatch_to_axis axis, :each, &block
end

#each_index(&block) ⇒ Object

Iterate over each index of the DataFrame.



711
712
713
714
715
716
717
# File 'lib/daru_lite/dataframe.rb', line 711

def each_index(&block)
  return to_enum(:each_index) unless block

  @index.each(&block)

  self
end

#each_rowObject

Iterate over each row



744
745
746
747
748
749
750
751
752
# File 'lib/daru_lite/dataframe.rb', line 744

def each_row
  return to_enum(:each_row) unless block_given?

  @index.size.times do |pos|
    yield row_at(pos)
  end

  self
end

#each_row_with_indexObject



754
755
756
757
758
759
760
761
762
# File 'lib/daru_lite/dataframe.rb', line 754

def each_row_with_index
  return to_enum(:each_row_with_index) unless block_given?

  @index.each do |index|
    yield access_row(index), index
  end

  self
end

#each_vector(&block) ⇒ Object Also known as: each_column

Iterate over each vector



720
721
722
723
724
725
726
# File 'lib/daru_lite/dataframe.rb', line 720

def each_vector(&block)
  return to_enum(:each_vector) unless block

  @data.each(&block)

  self
end

#each_vector_with_indexObject Also known as: each_column_with_index

Iterate over each vector alongwith the name of the vector



731
732
733
734
735
736
737
738
739
# File 'lib/daru_lite/dataframe.rb', line 731

def each_vector_with_index
  return to_enum(:each_vector_with_index) unless block_given?

  @vectors.each do |vector|
    yield @data[@vectors[vector]], vector
  end

  self
end

#filter(axis = :vector, &block) ⇒ Object

Retain vectors or rows if the block returns a truthy value.

Description

For filtering out certain rows/vectors based on their values, use the #filter method. By default it iterates over vectors and keeps those vectors for which the block returns true. It accepts an optional axis argument which lets you specify whether you want to iterate over vectors or rows.

Arguments

  • axis - The axis to map over. Can be :vector (or :column) or :row.

Default to :vector.

Usage

# Filter vectors

df.filter do |vector|
  vector.type == :numeric and vector.median < 50
end

# Filter rows

df.filter(:row) do |row|
  row[:a] + row[:d] < 100
end


885
886
887
# File 'lib/daru_lite/dataframe.rb', line 885

def filter(axis = :vector, &block)
  dispatch_to_axis_pl axis, :filter, &block
end

#filter_rowsObject

Iterates over each row and retains it in a new DataFrame if the block returns true for that row.



1081
1082
1083
1084
1085
1086
1087
# File 'lib/daru_lite/dataframe.rb', line 1081

def filter_rows
  return to_enum(:filter_rows) unless block_given?

  keep_rows = @index.map { |index| yield access_row(index) }

  where keep_rows
end

#filter_vector(vec, &block) ⇒ Object

creates a new vector with the data of a given field which the block returns true



1075
1076
1077
# File 'lib/daru_lite/dataframe.rb', line 1075

def filter_vector(vec, &block)
  DaruLite::Vector.new(each_row.select(&block).map { |row| row[vec] })
end

#filter_vectors(&block) ⇒ Object

Iterates over each vector and retains it in a new DataFrame if the block returns true for that vector.



1091
1092
1093
1094
1095
# File 'lib/daru_lite/dataframe.rb', line 1091

def filter_vectors(&block)
  return to_enum(:filter_vectors) unless block

  dup.tap { |df| df.keep_vector_if(&block) }
end

#get_sub_dataframe(keys, by_position: true) ⇒ DaruLite::Dataframe

Extract a dataframe given row indexes or positions

Parameters:

  • keys (Array)

    can be positions (if by_position is true) or indexes (if by_position if false)

Returns:

  • (DaruLite::Dataframe)


504
505
506
507
508
509
510
511
512
513
# File 'lib/daru_lite/dataframe.rb', line 504

def get_sub_dataframe(keys, by_position: true)
  return DaruLite::DataFrame.new({}) if keys == []

  keys = @index.pos(*keys) unless by_position

  sub_df = row_at(*keys)
  sub_df = sub_df.to_df.transpose if sub_df.is_a?(DaruLite::Vector)

  sub_df
end

#get_vector_anyways(v) ⇒ Object



1475
1476
1477
# File 'lib/daru_lite/dataframe.rb', line 1475

def get_vector_anyways(v)
  @vectors.include?(v) ? self[v].to_a : Array.new(size)
end

#group_by(*vectors) ⇒ Object

Group elements by vector to perform operations on them. Returns a DaruLite::Core::GroupBy object.See the DaruLite::Core::GroupBy docs for a detailed list of possible operations.

Arguments

  • vectors - An Array contatining names of vectors to group by.

Usage

df = DaruLite::DataFrame.new({
  a: %w{foo bar foo bar   foo bar foo foo},
  b: %w{one one two three two two one three},
  c:   [1  ,2  ,3  ,1    ,3  ,6  ,3  ,8],
  d:   [11 ,22 ,33 ,44   ,55 ,66 ,77 ,88]
})
df.group_by([:a,:b,:c]).groups
#=> {["bar", "one", 2]=>[1],
# ["bar", "three", 1]=>[3],
# ["bar", "two", 6]=>[5],
# ["foo", "one", 1]=>[0],
# ["foo", "one", 3]=>[6],
# ["foo", "three", 8]=>[7],
# ["foo", "two", 3]=>[2, 4]}

Raises:

  • (ArgumentError)


1453
1454
1455
1456
1457
1458
1459
1460
1461
# File 'lib/daru_lite/dataframe.rb', line 1453

def group_by(*vectors)
  vectors.flatten!
  missing = vectors - @vectors.to_a
  raise(ArgumentError, "Vector(s) missing: #{missing.join(', ')}") unless missing.empty?

  vectors = [@vectors.first] if vectors.empty?

  DaruLite::Core::GroupBy.new(self, vectors)
end

#group_by_and_aggregate(*group_by_keys, **aggregation_map) ⇒ Object



2413
2414
2415
# File 'lib/daru_lite/dataframe.rb', line 2413

def group_by_and_aggregate(*group_by_keys, **aggregation_map)
  group_by(*group_by_keys).aggregate(aggregation_map)
end

#has_missing_data?Boolean Also known as: flawed?

Returns:

  • (Boolean)


1218
1219
1220
# File 'lib/daru_lite/dataframe.rb', line 1218

def has_missing_data?
  @data.any? { |vec| vec.include_values?(*DaruLite::MISSING_VALUES) }
end

#has_vector?(vector) ⇒ Boolean

Check if a vector is present

Returns:

  • (Boolean)


1293
1294
1295
# File 'lib/daru_lite/dataframe.rb', line 1293

def has_vector?(vector)
  @vectors.include? vector
end

#head(quantity = 10) ⇒ Object Also known as: first

The first ten elements of the DataFrame

Parameters:

  • quantity (Fixnum) (defaults to: 10)

    (10) The number of elements to display from the top.



1341
1342
1343
# File 'lib/daru_lite/dataframe.rb', line 1341

def head(quantity = 10)
  row.at 0..(quantity - 1)
end

#include_values?(*values) ⇒ true, false

Check if any of given values occur in the data frame

Examples:

df = DaruLite::DataFrame.new({
  a: [1,    2,          3,   nil,        Float::NAN, nil, 1,   7],
  b: [:a,  :b,          nil, Float::NAN, nil,        3,   5,   8],
  c: ['a',  Float::NAN, 3,   4,          3,          5,   nil, 7]
}, index: 11..18)
df.include_values? nil
# => true

Parameters:

  • values (Array)

    to check for

Returns:

  • (true, false)

    true if any of the given values occur in the dataframe, false otherwise



1237
1238
1239
# File 'lib/daru_lite/dataframe.rb', line 1237

def include_values?(*values)
  @data.any? { |vec| vec.include_values?(*values) }
end

#insert_vector(n, name, source) ⇒ Object

Raises:

  • (ArgumentError)


480
481
482
483
484
485
486
487
488
489
490
# File 'lib/daru_lite/dataframe.rb', line 480

def insert_vector(n, name, source)
  raise ArgumentError unless source.is_a? Array

  vector = DaruLite::Vector.new(source, index: @index, name: @name)
  @data << vector
  @vectors = @vectors.add name
  ordr = @vectors.dup.to_a
  elmnt = ordr.pop
  ordr.insert n, elmnt
  self.order = ordr
end

#inspect(spacing = DaruLite.spacing, threshold = DaruLite.max_rows) ⇒ Object

Pretty print in a nice table format for the command line (irb/pry/iruby)



2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
# File 'lib/daru_lite/dataframe.rb', line 2204

def inspect(spacing = DaruLite.spacing, threshold = DaruLite.max_rows)
  name_part = @name ? ": #{@name} " : ''
  spacing = [
    headers.to_a.map { |header| header.try(:length) || header.to_s.length }.max,
    spacing
  ].max

  "#<#{self.class}#{name_part}(#{nrows}x#{ncols})>#{$INPUT_RECORD_SEPARATOR}" +
    Formatters::Table.format(
      each_row.lazy,
      row_headers: row_headers,
      headers: headers,
      threshold: threshold,
      spacing: spacing
    )
end

#interact_code(vector_names, full) ⇒ Object



2269
2270
2271
2272
2273
2274
2275
2276
2277
# File 'lib/daru_lite/dataframe.rb', line 2269

def interact_code(vector_names, full)
  dfs = vector_names.zip(full).map do |vec_name, f|
    self[vec_name].contrast_code(full: f).each.to_a
  end

  all_vectors = recursive_product(dfs)
  DaruLite::DataFrame.new all_vectors,
                          order: all_vectors.map(&:name)
end

#join(other_df, opts = {}) ⇒ DaruLite::DataFrame

Join 2 DataFrames with SQL style joins. Currently supports inner, left outer, right outer and full outer joins.

Examples:

Inner Join

left = DaruLite::DataFrame.new({
  :id   => [1,2,3,4],
  :name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti']
})
right = DaruLite::DataFrame.new({
  :id => [1,2,3,4],
  :name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja']
})
left.join(right, how: :inner, on: [:name])
#=>
##<DaruLite::DataFrame:82416700 @name = 74c0811b-76c6-4c42-ac93-e6458e82afb0 @size = 2>
#                 id_1       name       id_2
#         0          1     Pirate          2
#         1          3      Ninja          4

Parameters:

  • other_df (DaruLite::DataFrame)

    Another DataFrame on which the join is to be performed.

  • opts (Hash) (defaults to: {})

    Options Hash

  • :how (Hash)

    a customizable set of options

  • :on (Hash)

    a customizable set of options

  • :indicator (Hash)

    a customizable set of options

Returns:



1949
1950
1951
# File 'lib/daru_lite/dataframe.rb', line 1949

def join(other_df, opts = {})
  DaruLite::Core::Merge.join(self, other_df, opts)
end

#keep_row_ifObject



1062
1063
1064
1065
1066
# File 'lib/daru_lite/dataframe.rb', line 1062

def keep_row_if
  @index
    .reject { |idx| yield access_row(idx) }
    .each { |idx| delete_row idx }
end

#keep_vector_ifObject



1068
1069
1070
1071
1072
# File 'lib/daru_lite/dataframe.rb', line 1068

def keep_vector_if
  @vectors.each do |vector|
    delete_vector(vector) unless yield(@data[@vectors[vector]], vector)
  end
end

#map(axis = :vector, &block) ⇒ Object

Map over each vector or row of the data frame according to the argument specified. Will return an Array of the resulting elements. To map over each row/vector and get a DataFrame, see #recode.

Description

The #map iterator works like Array#map. The value returned by each run of the block is added to an Array and the Array is returned. This method also accepts an axis argument, like #each. The default is :vector.

Arguments

  • axis - The axis to map over. Can be :vector (or :column) or :row.

Default to :vector.



816
817
818
# File 'lib/daru_lite/dataframe.rb', line 816

def map(axis = :vector, &block)
  dispatch_to_axis_pl axis, :map, &block
end

#map!(axis = :vector, &block) ⇒ Object

Destructive map. Modifies the DataFrame. Each run of the block must return a DaruLite::Vector. You can specify the axis to map over as the argument. Default to :vector.

Arguments

  • axis - The axis to map over. Can be :vector (or :column) or :row.

Default to :vector.



828
829
830
831
832
833
834
# File 'lib/daru_lite/dataframe.rb', line 828

def map!(axis = :vector, &block)
  if %i[vector column].include?(axis)
    map_vectors!(&block)
  elsif axis == :row
    map_rows!(&block)
  end
end

#map_rows(&block) ⇒ Object

Map each row



935
936
937
938
939
# File 'lib/daru_lite/dataframe.rb', line 935

def map_rows(&block)
  return to_enum(:map_rows) unless block

  each_row.map(&block)
end

#map_rows!Object



947
948
949
950
951
952
953
954
955
# File 'lib/daru_lite/dataframe.rb', line 947

def map_rows!
  return to_enum(:map_rows!) unless block_given?

  index.dup.each do |i|
    row[i] = should_be_vector!(yield(row[i]))
  end

  self
end

#map_rows_with_index(&block) ⇒ Object



941
942
943
944
945
# File 'lib/daru_lite/dataframe.rb', line 941

def map_rows_with_index(&block)
  return to_enum(:map_rows_with_index) unless block

  each_row_with_index.map(&block)
end

#map_vectors(&block) ⇒ Object

Map each vector and return an Array.



910
911
912
913
914
# File 'lib/daru_lite/dataframe.rb', line 910

def map_vectors(&block)
  return to_enum(:map_vectors) unless block

  @data.map(&block)
end

#map_vectors!Object

Destructive form of #map_vectors



917
918
919
920
921
922
923
924
925
# File 'lib/daru_lite/dataframe.rb', line 917

def map_vectors!
  return to_enum(:map_vectors!) unless block_given?

  vectors.dup.each do |n|
    self[n] = should_be_vector!(yield(self[n]))
  end

  self
end

#map_vectors_with_index(&block) ⇒ Object

Map vectors alongwith the index.



928
929
930
931
932
# File 'lib/daru_lite/dataframe.rb', line 928

def map_vectors_with_index(&block)
  return to_enum(:map_vectors_with_index) unless block

  each_vector_with_index.map(&block)
end

#merge(other_df) ⇒ DaruLite::DataFrame

Merge vectors from two DataFrames. In case of name collision, the vectors names are changed to x_1, x_2 .…

Returns:



1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
# File 'lib/daru_lite/dataframe.rb', line 1904

def merge(other_df)
  unless nrows == other_df.nrows
    raise ArgumentError,
          "Number of rows must be equal in this: #{nrows} and other: #{other_df.nrows}"
  end

  new_fields = (@vectors.to_a + other_df.vectors.to_a)
  new_fields = ArrayHelper.recode_repeated(new_fields)
  DataFrame.new({}, order: new_fields).tap do |df_new|
    (0...nrows).each do |i|
      df_new.add_row row[i].to_a + other_df.row[i].to_a
    end
    df_new.index = @index if @index == other_df.index
    df_new.update
  end
end

#missing_values_rows(missing_values = [nil]) ⇒ Object Also known as: vector_missing_values

Return a vector with the number of missing values in each row.

Arguments

  • missing_values - An Array of the values that should be

treated as ‘missing’. The default missing value is nil.



1207
1208
1209
1210
1211
1212
1213
# File 'lib/daru_lite/dataframe.rb', line 1207

def missing_values_rows(missing_values = [nil])
  number_of_missing = each_row.map do |row|
    row.indexes(*missing_values).size
  end

  DaruLite::Vector.new number_of_missing, index: @index, name: "#{@name}_missing_rows"
end

#ncolsObject

The number of vectors



1288
1289
1290
# File 'lib/daru_lite/dataframe.rb', line 1288

def ncols
  @vectors.size
end

#nest(*tree_keys, &block) ⇒ Object

Return a nested hash using vector names as keys and an array constructed of hashes with other values. If block provided, is used to provide the values, with parameters row of dataset, current last hash on hierarchy and name of the key to include



1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
# File 'lib/daru_lite/dataframe.rb', line 1245

def nest(*tree_keys, &block)
  tree_keys = tree_keys[0] if tree_keys[0].is_a? Array

  each_row.with_object({}) do |row, current|
    # Create tree
    *keys, last = tree_keys
    current = keys.inject(current) { |c, f| c[row[f]] ||= {} }
    name = row[last]

    if block
      current[name] = yield(row, current, name)
    else
      current[name] ||= []
      current[name].push(row.to_h.delete_if { |key, _value| tree_keys.include? key })
    end
  end
end

#nrowsObject

The number of rows



1283
1284
1285
# File 'lib/daru_lite/dataframe.rb', line 1283

def nrows
  @index.size
end

#numeric_vector_namesObject



1711
1712
1713
# File 'lib/daru_lite/dataframe.rb', line 1711

def numeric_vector_names
  @vectors.select { |v| self[v].numeric? }
end

#numeric_vectorsObject

Return the indexes of all the numeric vectors. Will include vectors with nils alongwith numbers.



1704
1705
1706
1707
1708
1709
# File 'lib/daru_lite/dataframe.rb', line 1704

def numeric_vectors
  # FIXME: Why _with_index ?..
  each_vector_with_index
    .select { |vec, _i| vec.numeric? }
    .map(&:last)
end

#one_to_many(parent_fields, pattern) ⇒ Object

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.

for example, you have a survey for number of children with this structure:

id, name, child_name_1, child_age_1, child_name_2, child_age_2

with

ds.one_to_many([:id], "child_%v_%n"

the field of first parameters will be copied verbatim to new dataset, and fields which responds to second pattern will be added one case for each different %n.

Examples:

cases=[
  ['1','george','red',10,'blue',20,nil,nil],
  ['2','fred','green',15,'orange',30,'white',20],
  ['3','alfred',nil,nil,nil,nil,nil,nil]
]
ds=DaruLite::DataFrame.rows(cases, order:
  [:id, :name,
   :car_color1, :car_value1,
   :car_color2, :car_value2,
   :car_color3, :car_value3])
ds.one_to_many([:id],'car_%v%n').to_matrix
#=> Matrix[
#   ["red", "1", 10],
#   ["blue", "1", 20],
#   ["green", "2", 15],
#   ["orange", "2", 30],
#   ["white", "2", 20]
#   ]


1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
# File 'lib/daru_lite/dataframe.rb', line 1984

def one_to_many(parent_fields, pattern)
  vars, numbers = one_to_many_components(pattern)

  DataFrame.new([], order: [*parent_fields, '_col_id', *vars]).tap do |ds|
    each_row do |row|
      verbatim = parent_fields.map { |f| [f, row[f]] }.to_h
      numbers.each do |n|
        generated = one_to_many_row row, n, vars, pattern
        next if generated.values.all?(&:nil?)

        ds.add_row(verbatim.merge(generated).merge('_col_id' => n))
      end
    end
    ds.update
  end
end

#only_numerics(opts = {}) ⇒ Object

Return a DataFrame of only the numerical Vectors. If clone: false is specified as option, only a view of the Vectors will be returned. Defaults to clone: true.



1718
1719
1720
1721
1722
1723
1724
# File 'lib/daru_lite/dataframe.rb', line 1718

def only_numerics(opts = {})
  cln = opts[:clone] != false
  arry = numeric_vectors.map { |v| self[v] }

  order = Index.new(numeric_vectors)
  DaruLite::DataFrame.new(arry, clone: cln, order: order, index: @index)
end

#order=(order_array) ⇒ Object

Reorder the vectors in a dataframe

Examples:

df = DaruLite::DataFrame({
  a: [1, 2, 3],
  b: [4, 5, 6]
}, order: [:a, :b])
df.order = [:b, :a]
df
# => #<DaruLite::DataFrame(3x2)>
#       b   a
#   0   4   1
#   1   5   2
#   2   6   3

Parameters:

  • order_array (Array)

    new order of the vectors

Raises:

  • (ArgumentError)


1153
1154
1155
1156
1157
# File 'lib/daru_lite/dataframe.rb', line 1153

def order=(order_array)
  raise ArgumentError, 'Invalid order' unless vectors.to_a.tally == order_array.tally

  initialize(to_h, order: order_array)
end

#pivot_table(opts = {}) ⇒ Object

Pivots a data frame on specified vectors and applies an aggregate function to quickly generate a summary.

Options

:index - Keys to group by on the pivot table row index. Pass vector names contained in an Array.

:vectors - Keys to group by on the pivot table column index. Pass vector names contained in an Array.

:agg - Function to aggregate the grouped values. Default to :mean. Can use any of the statistics functions applicable on Vectors that can be found in the DaruLite::Statistics::Vector module.

:values - Columns to aggregate. Will consider all numeric columns not specified in :index or :vectors. Optional.

Usage

df = DaruLite::DataFrame.new({
  a: ['foo'  ,  'foo',  'foo',  'foo',  'foo',  'bar',  'bar',  'bar',  'bar'],
  b: ['one'  ,  'one',  'one',  'two',  'two',  'one',  'one',  'two',  'two'],
  c: ['small','large','large','small','small','large','small','large','small'],
  d: [1,2,2,3,3,4,5,6,7],
  e: [2,4,4,6,6,8,10,12,14]
})
df.pivot_table(index: [:a], vectors: [:b], agg: :sum, values: :e)

#=>
# #<DaruLite::DataFrame:88342020 @name = 08cdaf4e-b154-4186-9084-e76dd191b2c9 @size = 2>
#            [:e, :one] [:e, :two]
#     [:bar]         18         26
#     [:foo]         10         12

Raises:

  • (ArgumentError)


1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
# File 'lib/daru_lite/dataframe.rb', line 1883

def pivot_table(opts = {})
  raise ArgumentError, 'Specify grouping index' if Array(opts[:index]).empty?

  index               = opts[:index]
  vectors             = opts[:vectors] || []
  aggregate_function  = opts[:agg] || :mean
  values              = prepare_pivot_values index, vectors, opts
  raise IndexError, 'No numeric vectors to aggregate' if values.empty?

  grouped = group_by(index)
  return grouped.send(aggregate_function) if vectors.empty?

  super_hash = make_pivot_hash grouped, vectors, values, aggregate_function

  pivot_dataframe super_hash
end

#recode(axis = :vector, &block) ⇒ Object

Maps over the DataFrame and returns a DataFrame. Each run of the block must return a DaruLite::Vector object. You can specify the axis to map over. Default to :vector.

Description

Recode works similarly to #map, but an important difference between the two is that recode returns a modified DaruLite::DataFrame instead of an Array. For this reason, #recode expects that every run of the block to return a DaruLite::Vector.

Just like map and each, recode also accepts an optional axis argument.

Arguments

  • axis - The axis to map over. Can be :vector (or :column) or :row.

Default to :vector.



853
854
855
# File 'lib/daru_lite/dataframe.rb', line 853

def recode(axis = :vector, &block)
  dispatch_to_axis_pl axis, :recode, &block
end

#recode_rowsObject



899
900
901
902
903
904
905
906
907
# File 'lib/daru_lite/dataframe.rb', line 899

def recode_rows
  block_given? or return to_enum(:recode_rows)

  dup.tap do |df|
    df.each_row_with_index do |r, i|
      df.row[i] = should_be_vector!(yield(r))
    end
  end
end

#recode_vectorsObject



889
890
891
892
893
894
895
896
897
# File 'lib/daru_lite/dataframe.rb', line 889

def recode_vectors
  block_given? or return to_enum(:recode_vectors)

  dup.tap do |df|
    df.each_vector_with_index do |v, i|
      df[*i] = should_be_vector!(yield(v))
    end
  end
end

#reindex(new_index) ⇒ Object

Change the index of the DataFrame and preserve the labels of the previous indexing. New index can be DaruLite::Index or any of its subclasses.

Examples:

Reindexing DataFrame

df = DaruLite::DataFrame.new({a: [1,2,3,4], b: [11,22,33,44]},
  index: ['a','b','c','d'])
#=>
##<DaruLite::DataFrame:83278130 @name = b19277b8-c548-41da-ad9a-2ad8c060e273 @size = 4>
#                    a          b
#         a          1         11
#         b          2         22
#         c          3         33
#         d          4         44
df.reindex DaruLite::Index.new(['b', 0, 'a', 'g'])
#=>
##<DaruLite::DataFrame:83177070 @name = b19277b8-c548-41da-ad9a-2ad8c060e273 @size = 4>
#                    a          b
#         b          2         22
#         0        nil        nil
#         a          1         11
#         g        nil        nil

Parameters:

  • new_index (DaruLite::Index)

    The new Index for reindexing the DataFrame.



1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
# File 'lib/daru_lite/dataframe.rb', line 1587

def reindex(new_index)
  unless new_index.is_a?(DaruLite::Index)
    raise ArgumentError, 'Must pass the new index of type Index or its ' \
                         "subclasses, not #{new_index.class}"
  end

  cl = DaruLite::DataFrame.new({}, order: @vectors, index: new_index, name: @name)
  new_index.each_with_object(cl) do |idx, memo|
    memo.row[idx] = @index.include?(idx) ? row[idx] : Array.new(ncols)
  end
end

#reindex_vectors(new_vectors) ⇒ Object



1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
# File 'lib/daru_lite/dataframe.rb', line 1463

def reindex_vectors(new_vectors)
  unless new_vectors.is_a?(DaruLite::Index)
    raise ArgumentError, 'Must pass the new index of type Index or its ' \
                         "subclasses, not #{new_vectors.class}"
  end

  cl = DaruLite::DataFrame.new({}, order: new_vectors, index: @index, name: @name)
  new_vectors.each_with_object(cl) do |vec, memo|
    memo[vec] = @vectors.include?(vec) ? self[vec] : Array.new(nrows)
  end
end

#reject_values(*values) ⇒ DaruLite::DataFrame

Returns a dataframe in which rows with any of the mentioned values are ignored.

Examples:

df = DaruLite::DataFrame.new({
  a: [1,    2,          3,   nil,        Float::NAN, nil, 1,   7],
  b: [:a,  :b,          nil, Float::NAN, nil,        3,   5,   8],
  c: ['a',  Float::NAN, 3,   4,          3,          5,   nil, 7]
}, index: 11..18)
df.reject_values nil, Float::NAN
# => #<DaruLite::DataFrame(2x3)>
#       a   b   c
#   11   1   a   a
#   18   7   8   7

Parameters:

  • values (Array)

    to reject to form the new dataframe

Returns:

  • (DaruLite::DataFrame)

    Data Frame with only rows which doesn’t contain the mentioned values



588
589
590
591
592
593
594
595
596
597
598
# File 'lib/daru_lite/dataframe.rb', line 588

def reject_values(*values)
  positions =
    size.times.to_a - @data.flat_map { |vec| vec.positions(*values) }
  # Handle the case when positions size is 1 and #row_at wouldn't return a df
  if positions.size == 1
    pos = positions.first
    row_at(pos..pos)
  else
    row_at(*positions)
  end
end

#rename(new_name) ⇒ Object Also known as: name=

Rename the DataFrame.



2122
2123
2124
2125
# File 'lib/daru_lite/dataframe.rb', line 2122

def rename(new_name)
  @name = new_name
  self
end

#rename_vectors(name_map) ⇒ Object

Renames the vectors

Arguments

  • name_map - A hash where the keys are the exising vector names and

    the values are the new names.  If a vector is renamed
    to a vector name that is already in use, the existing
    one is overwritten.
    

Usage

df = DaruLite::DataFrame.new({ a: [1,2,3,4], b: [:a,:b,:c,:d], c: [11,22,33,44] })
df.rename_vectors :a => :alpha, :c => :gamma
df.vectors.to_a #=> [:alpha, :b, :gamma]


1669
1670
1671
1672
1673
1674
1675
# File 'lib/daru_lite/dataframe.rb', line 1669

def rename_vectors(name_map)
  existing_targets = name_map.reject { |k, v| k == v }.values & vectors.to_a
  delete_vectors(*existing_targets)

  new_names = vectors.to_a.map { |v| name_map[v] || v }
  self.vectors = DaruLite::Index.new new_names
end

#rename_vectors!(name_map) ⇒ Object

Renames the vectors and returns itself

Arguments

  • name_map - A hash where the keys are the exising vector names and

    the values are the new names.  If a vector is renamed
    to a vector name that is already in use, the existing
    one is overwritten.
    

Usage

df = DaruLite::DataFrame.new({ a: [1,2,3,4], b: [:a,:b,:c,:d], c: [11,22,33,44] })
df.rename_vectors! :a => :alpha, :c => :gamma # df


1690
1691
1692
1693
# File 'lib/daru_lite/dataframe.rb', line 1690

def rename_vectors!(name_map)
  rename_vectors(name_map)
  self
end

#replace_values(old_values, new_value) ⇒ DaruLite::DataFrame

Replace specified values with given value

Examples:

df = DaruLite::DataFrame.new({
  a: [1,    2,          3,   nil,        Float::NAN, nil, 1,   7],
  b: [:a,  :b,          nil, Float::NAN, nil,        3,   5,   8],
  c: ['a',  Float::NAN, 3,   4,          3,          5,   nil, 7]
}, index: 11..18)
df.replace_values nil, Float::NAN
# => #<DaruLite::DataFrame(8x3)>
#       a   b   c
#   11   1   a   a
#   12   2   b NaN
#   13   3 NaN   3
#   14 NaN NaN   4
#   15 NaN NaN   3
#   16 NaN   3   5
#   17   1   5 NaN
#   18   7   8   7

Parameters:

  • old_values (Array)

    values to replace with new value

  • new_value (object)

    new value to replace with

Returns:



622
623
624
625
# File 'lib/daru_lite/dataframe.rb', line 622

def replace_values(old_values, new_value)
  @data.each { |vec| vec.replace_values old_values, new_value }
  self
end

#reset_indexObject



1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
# File 'lib/daru_lite/dataframe.rb', line 1599

def reset_index
  index_df = index.to_df
  names = index.name
  names = [names] unless names.instance_of?(Array)
  new_vectors = names + vectors.to_a
  self.index = index_df.index
  names.each do |name|
    self[name] = index_df[name]
  end
  self.order = new_vectors
  self
end

#respond_to_missing?(name, include_private = false) ⇒ Boolean

Returns:

  • (Boolean)


2265
2266
2267
# File 'lib/daru_lite/dataframe.rb', line 2265

def respond_to_missing?(name, include_private = false)
  name.to_s.end_with?('=') || has_vector?(name) || super
end

#rolling_fillna(direction = :forward) ⇒ Object



667
668
669
# File 'lib/daru_lite/dataframe.rb', line 667

def rolling_fillna(direction = :forward)
  dup.rolling_fillna!(direction)
end

#rolling_fillna!(direction = :forward) ⇒ Object

Rolling fillna replace all Float::NAN and NIL values with the preceeding or following value

Examples:

df = DaruLite::DataFrame.new({
 a: [1,    2,          3,   nil,        Float::NAN, nil, 1,   7],
 b: [:a,  :b,          nil, Float::NAN, nil,        3,   5,   nil],
 c: ['a',  Float::NAN, 3,   4,          3,          5,   nil, 7]
})

=> #<DaruLite::DataFrame(8x3)>
     a   b   c
 0   1   a   a
 1   2   b NaN
 2   3 nil   3
 3 nil NaN   4
 4 NaN nil   3
 5 nil   3   5
 6   1   5 nil
 7   7 nil   7

2.3.3 :068 > df.rolling_fillna(:forward)
=> #<DaruLite::DataFrame(8x3)>
     a   b   c
 0   1   a   a
 1   2   b   a
 2   3   b   3
 3   3   b   4
 4   3   b   3
 5   3   3   5
 6   1   5   5
 7   7   5   7

Parameters:

  • direction (Symbol) (defaults to: :forward)

    (:forward, :backward) whether replacement value is preceeding or following



662
663
664
665
# File 'lib/daru_lite/dataframe.rb', line 662

def rolling_fillna!(direction = :forward)
  @data.each { |vec| vec.rolling_fillna!(direction) }
  self
end

#rotate_vectors(count = -1)) ⇒ Object

Return the dataframe with rotate vectors positions, the vector at position count is now the first vector of the dataframe. If only one vector in the dataframe, the dataframe is return without any change.

Examples:

df = DaruLite::DataFrame({
  a: [1, 2, 3],
  b: [4, 5, 6],
  total: [5, 7, 9],
})
df.rotate_vectors(-1)
df
# => #<DaruLite::DataFrame(3x3)>
#       total b   a
#   0   5     4   1
#   1   7     5   2
#   2   9     6   3

Parameters:

  • count (defaults to: -1))

    > Integer, the vector at position count will be the first vector of the dataframe.



1176
1177
1178
1179
1180
1181
# File 'lib/daru_lite/dataframe.rb', line 1176

def rotate_vectors(count = -1)
  return self unless vectors.many?

  self.order = vectors.to_a.rotate(count)
  self
end

#rowObject

Access a row or set/create a row. Refer #[] and #[]= docs for details.

Usage

df.row[:a] # access row named ':a'
df.row[:b] = [1,2,3] # set row ':b' to [1,2,3]


497
498
499
# File 'lib/daru_lite/dataframe.rb', line 497

def row
  DaruLite::Accessors::DataFrameByRow.new(self)
end

#row_at(*positions) ⇒ DaruLite::Vector, DaruLite::DataFrame

Retrive rows by positions

Examples:

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'b', 'c']
})
df.row_at 1, 2
# => #<DaruLite::DataFrame(2x2)>
#       a   b
#   1   2   b
#   2   3   c

Parameters:

  • positions (Array<Integer>)

    of rows to retrive

Returns:



340
341
342
343
344
345
346
347
348
349
350
351
352
# File 'lib/daru_lite/dataframe.rb', line 340

def row_at(*positions)
  original_positions = positions
  positions = coerce_positions(*positions, nrows)
  validate_positions(*positions, nrows)

  if positions.is_a? Integer
    row = get_rows_for([positions])
    DaruLite::Vector.new row, index: @vectors
  else
    new_rows = get_rows_for(original_positions)
    DaruLite::DataFrame.new new_rows, index: @index.at(*original_positions), order: @vectors
  end
end

#save(filename) ⇒ Object

Use marshalling to save dataframe to a file.



2171
2172
2173
# File 'lib/daru_lite/dataframe.rb', line 2171

def save(filename)
  DaruLite::IO.save self, filename
end

#set_at(positions, vector) ⇒ Object

Set vectors by positions

Examples:

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'b', 'c']
})
df.set_at [0], ['x', 'y', 'z']
df
#=> #<DaruLite::DataFrame(3x2)>
#       a   b
#   0   x   a
#   1   y   b
#   2   z   c

Parameters:

Raises:



437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
# File 'lib/daru_lite/dataframe.rb', line 437

def set_at(positions, vector)
  if positions.last == :row
    positions.pop
    return set_row_at(positions, vector)
  end

  validate_positions(*positions, ncols)
  vector =
    if vector.is_a? DaruLite::Vector
      vector.reindex @index
    else
      DaruLite::Vector.new vector
    end

  raise SizeError, 'Vector length should match index length' if
    vector.size != @index.size

  positions.each { |pos| @data[pos] = vector }
end

#set_index(new_index_col, keep: false, categorical: false) ⇒ Object

Set a particular column as the new DF



1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
# File 'lib/daru_lite/dataframe.rb', line 1545

def set_index(new_index_col, keep: false, categorical: false)
  if categorical
    strategy = SetCategoricalIndexStrategy
  elsif new_index_col.respond_to?(:to_a)
    strategy = SetMultiIndexStrategy
    new_index_col = new_index_col.to_a
  else
    strategy = SetSingleIndexStrategy
  end

  unless categorical
    uniq_size = strategy.uniq_size(self, new_index_col)
    raise ArgumentError, 'All elements in new index must be unique.' if @size != uniq_size
  end

  self.index = strategy.new_index(self, new_index_col)
  strategy.delete_vector(self, new_index_col) unless keep
  self
end

#set_row_at(positions, vector) ⇒ Object

Set rows by positions

Examples:

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'b', 'c']
})
df.set_row_at [0, 1], ['x', 'x']
df
#=> #<DaruLite::DataFrame(3x2)>
#       a   b
#   0   x   x
#   1   x   x
#   2   3   c

Parameters:

Raises:



369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
# File 'lib/daru_lite/dataframe.rb', line 369

def set_row_at(positions, vector)
  validate_positions(*positions, nrows)
  vector =
    if vector.is_a? DaruLite::Vector
      vector.reindex @vectors
    else
      DaruLite::Vector.new vector
    end

  raise SizeError, 'Vector length should match row length' if
    vector.size != @vectors.size

  @data.each_with_index do |vec, pos|
    vec.set_at(positions, vector.at(pos))
  end
  @index = @data[0].index
  set_size
end

#shapeObject

Return the number of rows and columns of the DataFrame in an Array.



1278
1279
1280
# File 'lib/daru_lite/dataframe.rb', line 1278

def shape
  [nrows, ncols]
end

#sort(vector_order, opts = {}) ⇒ Object

Non-destructive version of #sort!



1845
1846
1847
# File 'lib/daru_lite/dataframe.rb', line 1845

def sort(vector_order, opts = {})
  dup.sort! vector_order, opts
end

#sort!(vector_order, opts = {}) ⇒ Object

Sorts a dataframe (ascending/descending) in the given pripority sequence of vectors, with or without a block.

Examples:

Sort a dataframe with a vector sequence.


df = DaruLite::DataFrame.new({a: [1,2,1,2,3], b: [5,4,3,2,1]})

df.sort [:a, :b]
# =>
# <DaruLite::DataFrame:30604000 @name = d6a9294e-2c09-418f-b646-aa9244653444 @size = 5>
#                   a          b
#        2          1          3
#        0          1          5
#        3          2          2
#        1          2          4
#        4          3          1

Sort a dataframe without a block. Here nils will be handled automatically.


df = DaruLite::DataFrame.new({a: [-3,nil,-1,nil,5], b: [4,3,2,1,4]})

df.sort([:a])
# =>
# <DaruLite::DataFrame:14810920 @name = c07fb5c7-2201-458d-b679-6a1f7ebfe49f @size = 5>
#                    a          b
#         1        nil          3
#         3        nil          1
#         0         -3          4
#         2         -1          2
#         4          5          4

Sort a dataframe with a block with nils handled automatically.


df = DaruLite::DataFrame.new({a: [nil,-1,1,nil,-1,1], b: ['aaa','aa',nil,'baaa','x',nil] })

df.sort [:b], by: {b: lambda { |a| a.length } }
# NoMethodError: undefined method `length' for nil:NilClass
# from (pry):8:in `block in __pry__'

df.sort [:b], by: {b: lambda { |a| a.length } }, handle_nils: true

# =>
# <DaruLite::DataFrame:28469540 @name = 5f986508-556f-468b-be0c-88cc3534445c @size = 6>
#                    a          b
#         2          1        nil
#         5          1        nil
#         4         -1          x
#         1         -1         aa
#         0        nil        aaa
#         3        nil       baaa

Sort a dataframe with a block with nils handled manually.


df = DaruLite::DataFrame.new({a: [nil,-1,1,nil,-1,1], b: ['aaa','aa',nil,'baaa','x',nil] })

# To print nils at the bottom one can use lambda { |a| (a.nil?)[1]:[0,a.length] }
df.sort [:b], by: {b: lambda { |a| (a.nil?)?[1]:[0,a.length] } }, handle_nils: true

# =>
#<DaruLite::DataFrame:22214180 @name = cd7703c7-1dca-4560-840b-5ea51a852ef9 @size = 6>
#                 a          b
#      4         -1          x
#      1         -1         aa
#      0        nil        aaa
#      3        nil       baaa
#      2          1        nil
#      5          1        nil

Parameters:

  • vector_order (Array)

    The order of vector names in which the DataFrame should be sorted.

  • opts (Hash) (defaults to: {})

    opts The options to sort with.

Options Hash (opts):

  • :ascending (TrueClass, FalseClass, Array) — default: true

    Sort in ascending or descending order. Specify Array corresponding to order for multiple sort orders.

  • :by (Hash) — default: lambda{|a| a }

    Specify attributes of objects to to be used for sorting, for each vector name in order as a hash of vector name and lambda expressions. In case a lambda for a vector is not specified, the default will be used.

  • :handle_nils (TrueClass, FalseClass, Array) — default: false

    Handle nils automatically or not when a block is provided. If set to True, nils will appear at top after sorting.

Raises:

  • (ArgumentError)


1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
# File 'lib/daru_lite/dataframe.rb', line 1821

def sort!(vector_order, opts = {})
  raise ArgumentError, 'Required atleast one vector name' if vector_order.empty?

  # To enable sorting with categorical data,
  # map categories to integers preserving their order
  old = convert_categorical_vectors vector_order
  block = sort_prepare_block vector_order, opts

  order = @index.size.times.sort(&block)
  new_index = @index.reorder order

  # To reverse map mapping of categorical data to integers
  restore_categorical_vectors old

  @data.each do |vector|
    vector.reorder! order
  end

  self.index = new_index

  self
end

#split_by_category(cat_name) ⇒ Array

Split the dataframe into many dataframes based on category vector

Examples:

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'a', 'b']
})
df.to_category :b
df.split_by_category :b
# => [#<DaruLite::DataFrame: a (2x1)>
#       a
#   0   1
#   1   2,
# #<DaruLite::DataFrame: b (1x1)>
#       a
#   2   3]

Parameters:

  • cat_name (object)

    name of category vector to split the dataframe

Returns:

  • (Array)

    array of dataframes split by category with category vector used to split not included

Raises:

  • (ArgumentError)


2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
# File 'lib/daru_lite/dataframe.rb', line 2297

def split_by_category(cat_name)
  cat_dv = self[cat_name]
  raise ArgumentError, "#{cat_name} is not a category vector" unless
    cat_dv.category?

  cat_dv.categories.map do |cat|
    where(cat_dv.eq cat)
      .rename(cat)
      .delete_vector cat_name
  end
end

#summaryString

Generate a summary of this DataFrame based on individual vectors in the DataFrame

Returns:

  • (String)

    String containing the summary of the DataFrame



1728
1729
1730
1731
1732
1733
1734
1735
1736
# File 'lib/daru_lite/dataframe.rb', line 1728

def summary
  summary = "= #{name}"
  summary << "\n  Number of rows: #{nrows}"
  @vectors.each do |v|
    summary << "\n  Element:[#{v}]\n"
    summary << self[v].summary(1)
  end
  summary
end

#tail(quantity = 10) ⇒ Object Also known as: last

The last ten elements of the DataFrame

Parameters:

  • quantity (Fixnum) (defaults to: 10)

    (10) The number of elements to display from the bottom.



1350
1351
1352
1353
# File 'lib/daru_lite/dataframe.rb', line 1350

def tail(quantity = 10)
  start = [-quantity, -size].max
  row.at start..-1
end

#to_aObject

Converts the DataFrame into an array of hashes where key is vector name and value is the corresponding element. The 0th index of the array contains the array of hashes while the 1th index contains the indexes of each row of the dataframe. Each element in the index array corresponds to its row in the array of hashes, which has the same index.



2053
2054
2055
# File 'lib/daru_lite/dataframe.rb', line 2053

def to_a
  [each_row.map(&:to_h), @index.to_a]
end

#to_category(*names) ⇒ DaruLite::DataFrame

Converts the specified non category type vectors to category type vectors

Examples:

df = DaruLite::DataFrame.new({
  a: [1, 2, 3],
  b: ['a', 'a', 'b']
})
df.to_category :b
df[:b].type
# => :category

Parameters:

  • names (Array)

    of non category type vectors to be converted

Returns:

  • (DaruLite::DataFrame)

    data frame in which specified vectors have been converted to category type



2246
2247
2248
2249
# File 'lib/daru_lite/dataframe.rb', line 2246

def to_category(*names)
  names.each { |n| self[n] = self[n].to_category }
  self
end

#to_dfself

Returns the dataframe. This can be convenient when the user does not know whether the object is a vector or a dataframe.

Returns:

  • (self)

    the dataframe



2039
2040
2041
# File 'lib/daru_lite/dataframe.rb', line 2039

def to_df
  self
end

#to_hObject

Converts DataFrame to a hash (explicit) with keys as vector names and values as the corresponding vectors.



2069
2070
2071
2072
2073
# File 'lib/daru_lite/dataframe.rb', line 2069

def to_h
  @vectors
    .each_with_index
    .map { |vec_name, idx| [vec_name, @data[idx]] }.to_h
end

#to_html(threshold = DaruLite.max_rows) ⇒ Object

Convert to html for IRuby.



2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
# File 'lib/daru_lite/dataframe.rb', line 2076

def to_html(threshold = DaruLite.max_rows)
  table_thead = to_html_thead
  table_tbody = to_html_tbody(threshold)
  path = if index.is_a?(MultiIndex)
           File.expand_path('iruby/templates/dataframe_mi.html.erb', __dir__)
         else
           File.expand_path('iruby/templates/dataframe.html.erb', __dir__)
         end
  ERB.new(File.read(path).strip).result(binding)
end

#to_html_tbody(threshold = DaruLite.max_rows) ⇒ Object



2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
# File 'lib/daru_lite/dataframe.rb', line 2097

def to_html_tbody(threshold = DaruLite.max_rows)
  threshold ||= @size
  table_tbody_path =
    if index.is_a?(MultiIndex)
      File.expand_path('iruby/templates/dataframe_mi_tbody.html.erb', __dir__)
    else
      File.expand_path('iruby/templates/dataframe_tbody.html.erb', __dir__)
    end
  ERB.new(File.read(table_tbody_path).strip).result(binding)
end

#to_html_theadObject



2087
2088
2089
2090
2091
2092
2093
2094
2095
# File 'lib/daru_lite/dataframe.rb', line 2087

def to_html_thead
  table_thead_path =
    if index.is_a?(MultiIndex)
      File.expand_path('iruby/templates/dataframe_mi_thead.html.erb', __dir__)
    else
      File.expand_path('iruby/templates/dataframe_thead.html.erb', __dir__)
    end
  ERB.new(File.read(table_thead_path).strip).result(binding)
end

#to_json(no_index = true) ⇒ Object

Convert to json. If no_index is false then the index will NOT be included in the JSON thus created.



2059
2060
2061
2062
2063
2064
2065
# File 'lib/daru_lite/dataframe.rb', line 2059

def to_json(no_index = true)
  if no_index
    to_a[0].to_json
  else
    to_a.to_json
  end
end

#to_matrixObject

Convert all vectors of type :numeric into a Matrix.



2044
2045
2046
# File 'lib/daru_lite/dataframe.rb', line 2044

def to_matrix
  Matrix.columns each_vector.select(&:numeric?).map(&:to_a)
end

#to_sObject



2108
2109
2110
# File 'lib/daru_lite/dataframe.rb', line 2108

def to_s
  "#<#{self.class}#{": #{@name}" if @name}(#{nrows}x#{ncols})>"
end

#transposeObject

Transpose a DataFrame, tranposing elements and row, column indexing.



2193
2194
2195
2196
2197
2198
2199
2200
2201
# File 'lib/daru_lite/dataframe.rb', line 2193

def transpose
  DaruLite::DataFrame.new(
    each_vector.map(&:to_a).transpose,
    index: @vectors,
    order: @index,
    dtype: @dtype,
    name: @name
  )
end

#union(other_df) ⇒ Object

Concatenates another DataFrame as #concat. Additionally it tries to preserve the index. If the indices contain common elements, #union will overwrite the according rows in the first dataframe.



1495
1496
1497
1498
1499
1500
1501
1502
# File 'lib/daru_lite/dataframe.rb', line 1495

def union(other_df)
  index = (@index.to_a + other_df.index.to_a).uniq
  df = row[*(@index.to_a - other_df.index.to_a)]

  df = df.concat(other_df)
  df.index = DaruLite::Index.new(index)
  df
end

#uniq(*vtrs) ⇒ Object

Return unique rows by vector specified or all vectors

Examples:


=> #<DaruLite::DataFrame(6x2)>
     a   b
 0   1   a
 1   2   b
 2   3   c
 3   4   d
 2   3   c
 3   4   f

2.3.3 :> df.unique
=> #<DaruLite::DataFrame(5x2)>
     a   b
 0   1   a
 1   2   b
 2   3   c
 3   4   d
 3   4   f

2.3.3 :> df.unique(:a)
=> #<DaruLite::DataFrame(5x2)>
     a   b
 0   1   a
 1   2   b
 2   3   c
 3   4   d

Parameters:

  • vtrs (String)
    Symbol

    vector names(s) that should be considered



703
704
705
706
707
708
# File 'lib/daru_lite/dataframe.rb', line 703

def uniq(*vtrs)
  vecs = vtrs.empty? ? vectors.to_a : Array(vtrs)
  grouped = group_by(vecs)
  indexes = grouped.groups.values.map { |v| v[0] }.sort
  row[*indexes]
end

#updateObject

Method for updating the metadata (i.e. missing value positions) of the after assingment/deletion etc. are complete. This is provided so that time is not wasted in creating the metadata for the vector each time assignment/deletion of elements is done. Updating data this way is called lazy loading. To set or unset lazy loading, see the .lazy_update= method.



2117
2118
2119
# File 'lib/daru_lite/dataframe.rb', line 2117

def update
  @data.each(&:update) if DaruLite.lazy_update
end

#vector_by_calculation(&block) ⇒ Object

DSL for yielding each row and returning a DaruLite::Vector based on the value each run of the block returns.

Usage

a1 = DaruLite::Vector.new([1, 2, 3, 4, 5, 6, 7])
a2 = DaruLite::Vector.new([10, 20, 30, 40, 50, 60, 70])
a3 = DaruLite::Vector.new([100, 200, 300, 400, 500, 600, 700])
ds = DaruLite::DataFrame.new({ :a => a1, :b => a2, :c => a3 })
total = ds.vector_by_calculation { a + b + c }
# <DaruLite::Vector:82314050 @name = nil @size = 7 >
#   nil
# 0 111
# 1 222
# 2 333
# 3 444
# 4 555
# 5 666
# 6 777


1133
1134
1135
1136
1137
# File 'lib/daru_lite/dataframe.rb', line 1133

def vector_by_calculation(&block)
  a = each_row.map { |r| r.instance_eval(&block) }

  DaruLite::Vector.new a, index: @index
end

#vector_count_characters(vecs = nil) ⇒ Object



1263
1264
1265
1266
1267
1268
1269
# File 'lib/daru_lite/dataframe.rb', line 1263

def vector_count_characters(vecs = nil)
  vecs ||= @vectors.to_a

  collect_rows do |row|
    vecs.sum { |v| row[v].to_s.size }
  end
end

#vector_mean(max_missing = 0) ⇒ Object

Calculate mean of the rows of the dataframe.

Arguments

  • max_missing - The maximum number of elements in the row that can be

zero for the mean calculation to happen. Default to 0.



1419
1420
1421
1422
1423
1424
1425
1426
1427
# File 'lib/daru_lite/dataframe.rb', line 1419

def vector_mean(max_missing = 0)
  # FIXME: in vector_sum we preserve created vector dtype, but
  # here we are not. Is this by design or ...? - zverok, 2016-05-18
  mean_vec = DaruLite::Vector.new [0] * @size, index: @index, name: "mean_#{@name}"

  each_row_with_index.with_object(mean_vec) do |(row, i), memo|
    memo[i] = row.indexes(*DaruLite::MISSING_VALUES).size > max_missing ? nil : row.mean
  end
end

#vector_sum(*args) ⇒ Object

Sum all numeric/specified vectors in the DataFrame.

Returns a new vector that’s a containing a sum of all numeric or specified vectors of the DataFrame. By default, if the vector contains a nil, the sum is nil. With :skipnil argument set to true, nil values are assumed to be 0 (zero) and the sum vector is returned.

Examples:

df = DaruLite::DataFrame.new({
   a: [1, 2, nil],
   b: [2, 1, 3],
   c: [1, 1, 1]
 })
=> #<DaruLite::DataFrame(3x3)>
       a   b   c
   0   1   2   1
   1   2   1   1
   2 nil   3   1
df.vector_sum [:a, :c]
=> #<DaruLite::Vector(3)>
   0   2
   1   3
   2 nil
df.vector_sum
=> #<DaruLite::Vector(3)>
   0   4
   1   4
   2 nil
df.vector_sum skipnil: true
=> #<DaruLite::Vector(3)>
       c
   0   4
   1   4
   2   4

Parameters:

  • args (Array)

    List of vectors to sum. Default is nil in which case all numeric vectors are summed.

  • opts (Hash)

    a customizable set of options

Returns:

  • Vector with sum of all vectors specified in the argument. If vecs parameter is empty, sum all numeric vector.



1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
# File 'lib/daru_lite/dataframe.rb', line 1401

def vector_sum(*args)
  defaults = { vecs: nil, skipnil: false }
  options = args.last.is_a?(::Hash) ? args.pop : {}
  options = defaults.merge(options)
  vecs = args[0] || options[:vecs]
  skipnil = args[1] || options[:skipnil]

  vecs ||= numeric_vectors
  sum = DaruLite::Vector.new [0] * @size, index: @index, name: @name, dtype: @dtype
  vecs.inject(sum) { |memo, n| self[n].add(memo, skipnil: skipnil) }
end

#verify(*tests) ⇒ Object

Test each row with one or more tests. The function returns an array with all errors.

FIXME: description here is too sparse. As far as I can get, it should tell something about that each test is [descr, fields, block], and that first value may be column name to output. - zverok, 2016-05-18

Parameters:

  • tests (Proc)

    Each test is a Proc with the form *Proc.new {|row| row > 0}*



1105
1106
1107
1108
1109
1110
1111
1112
# File 'lib/daru_lite/dataframe.rb', line 1105

def verify(*tests)
  id = tests.first.is_a?(Symbol) ? tests.shift : @vectors.first

  each_row_with_index.map do |row, i|
    tests.reject { |*_, block| block.call(row) }
         .map { |test| verify_error_message row, test, id, i }
  end.flatten
end

#where(bool_array) ⇒ Object

Query a DataFrame by passing a DaruLite::Core::Query::BoolArray object.



2222
2223
2224
# File 'lib/daru_lite/dataframe.rb', line 2222

def where(bool_array)
  DaruLite::Core::Query.df_where self, bool_array
end

#which(&block) ⇒ Object

a simple query DSL for accessing where(), inspired by gem “squeel” e.g.: df.which{ ‘FamilySize` == `FamilySize`.max } equals df.where( df.eq( df.max ) )

e.g.: df.which{ (‘NameTitle` == ’Dr’) & (‘Sex` == ’female’) } equals df.where( df.eq(‘Dr’) & df.eq(‘female’) )



15
16
17
# File 'lib/daru_lite/extensions/which_dsl.rb', line 15

def which(&block)
  WhichQuery.new(self, &block).exec
end

#write_csv(filename, opts = {}) ⇒ Object

Write this DataFrame to a CSV file.

Arguments

  • filename - Path of CSV file where the DataFrame is to be saved.

Options

  • convert_comma - If set to true, will convert any commas in any

of the data to full stops (‘.’). All the options accepted by CSV.read() can also be passed into this function.



2141
2142
2143
# File 'lib/daru_lite/dataframe.rb', line 2141

def write_csv(filename, opts = {})
  DaruLite::IO.dataframe_write_csv self, filename, opts
end

#write_excel(filename, opts = {}) ⇒ Object

Write this dataframe to an Excel Spreadsheet

Arguments

  • filename - The path of the file where the DataFrame should be written.



2150
2151
2152
# File 'lib/daru_lite/dataframe.rb', line 2150

def write_excel(filename, opts = {})
  DaruLite::IO.dataframe_write_excel self, filename, opts
end

#write_sql(dbh, table) ⇒ Object

Insert each case of the Dataset on the selected table

Arguments

  • dbh - DBI database connection object.

  • query - Query string.

Usage

ds = DaruLite::DataFrame.new({:id=>DaruLite::Vector.new([1,2,3]), :name=>DaruLite::Vector.new(["a","b","c"])})
dbh = DBI.connect("DBI:Mysql:database:localhost", "user", "password")
ds.write_sql(dbh,"test")


2166
2167
2168
# File 'lib/daru_lite/dataframe.rb', line 2166

def write_sql(dbh, table)
  DaruLite::IO.dataframe_write_sql self, dbh, table
end