Class: Polars::DataFrame

Inherits:
Object
  • Object
show all
Defined in:
lib/polars/data_frame.rb

Overview

Two-dimensional data structure representing data as a table with rows and columns.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false) ⇒ DataFrame

Create a new DataFrame.

Parameters:

  • data (Object) (defaults to: nil)

    Two-dimensional data in various forms; hash input must contain arrays or a range. Arrays may contain Series or other arrays.

  • schema (Object) (defaults to: nil)

    The schema of the resulting DataFrame. The schema may be declared in several ways:

    • As a hash of \{name:type} pairs; if type is nil, it will be auto-inferred.
    • As an array of column names; in this case types are automatically inferred.
    • As an array of (name,type) pairs; this is equivalent to the hash form.

    If you supply an array of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.

    If set to nil (default), the schema is inferred from the data.

  • schema_overrides (Hash) (defaults to: nil)

    Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden.

    The number of entries in the schema should match the underlying data dimensions, unless an array of hashes is being passed, in which case a partial schema can be declared to prevent specific fields from being loaded.

  • strict (Boolean) (defaults to: true)

    Throw an error if any data value does not exactly match the given or inferred data type for that column. If set to false, values that do not match the data type are cast to that data type or, if casting is not possible, set to null instead.

  • orient ("col", "row") (defaults to: nil)

    Whether to interpret two-dimensional data as columns or as rows. If nil, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.

  • infer_schema_length (Integer) (defaults to: N_INFER_DEFAULT)

    The maximum number of rows to scan for schema inference. If set to nil, the full data may be scanned (this can be slow). This parameter only applies if the input data is an array or generator of rows; other input is read as-is.

  • nan_to_null (Boolean) (defaults to: false)

    If the data comes from one or more Numo arrays, can optionally convert input data NaN values to null instead. This is a no-op for all other input data.



48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# File 'lib/polars/data_frame.rb', line 48

def initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false)
  if defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || data.is_a?(ActiveRecord::Result))
    raise ArgumentError, "Use read_database instead"
  end

  if data.nil?
    self._df = Utils.hash_to_rbdf({}, schema: schema, schema_overrides: schema_overrides)
  elsif data.is_a?(Hash)
    data = data.transform_keys { |v| v.is_a?(Symbol) ? v.to_s : v }
    self._df = Utils.hash_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, nan_to_null: nan_to_null)
  elsif data.is_a?(::Array)
    self._df = Utils.sequence_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length)
  elsif data.is_a?(Series)
    self._df = Utils.series_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict)
  elsif data.respond_to?(:arrow_c_stream)
    # This uses the fact that RbSeries.from_arrow_c_stream will create a
    # struct-typed Series. Then we unpack that to a DataFrame.
    tmp_col_name = ""
    s = Utils.wrap_s(RbSeries.from_arrow_c_stream(data))
    self._df = s.to_frame(tmp_col_name).unnest(tmp_col_name)._df
  else
    raise ArgumentError, "DataFrame constructor called with unsupported type; got #{data.class.name}"
  end
end

Class Method Details

.deserialize(source) ⇒ DataFrame

Note:

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Read a serialized DataFrame from a file.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [4.0, 5.0, 6.0]})
bytes = df.serialize
Polars::DataFrame.deserialize(StringIO.new(bytes))
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 4.0 │
# │ 2   ┆ 5.0 │
# │ 3   ┆ 6.0 │
# └─────┴─────┘

Parameters:

  • source (Object)

    Path to a file or a file-like object (by file-like object, we refer to objects that have a read method, such as a file handler or StringIO).

Returns:



100
101
102
103
104
105
106
107
108
# File 'lib/polars/data_frame.rb', line 100

def self.deserialize(source)
  if Utils.pathlike?(source)
    source = Utils.normalize_filepath(source)
  end

  deserializer = RbDataFrame.method(:deserialize_binary)

  _from_rbdf(deserializer.(source))
end

Instance Method Details

#!=(other) ⇒ DataFrame

Not equal.

Returns:



299
300
301
# File 'lib/polars/data_frame.rb', line 299

def !=(other)
  _comp(other, "neq")
end

#%(other) ⇒ DataFrame

Returns the modulo.

Returns:



382
383
384
385
386
387
388
389
# File 'lib/polars/data_frame.rb', line 382

def %(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.rem_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.rem(other._s))
end

#*(other) ⇒ DataFrame

Performs multiplication.

Returns:



334
335
336
337
338
339
340
341
# File 'lib/polars/data_frame.rb', line 334

def *(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.mul_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.mul(other._s))
end

#+(other) ⇒ DataFrame

Performs addition.

Returns:



358
359
360
361
362
363
364
365
# File 'lib/polars/data_frame.rb', line 358

def +(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.add_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.add(other._s))
end

#-(other) ⇒ DataFrame

Performs subtraction.

Returns:



370
371
372
373
374
375
376
377
# File 'lib/polars/data_frame.rb', line 370

def -(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.sub_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.sub(other._s))
end

#/(other) ⇒ DataFrame

Performs division.

Returns:



346
347
348
349
350
351
352
353
# File 'lib/polars/data_frame.rb', line 346

def /(other)
  if other.is_a?(DataFrame)
    return _from_rbdf(_df.div_df(other._df))
  end

  other = _prepare_other_arg(other)
  _from_rbdf(_df.div(other._s))
end

#<(other) ⇒ DataFrame

Less than.

Returns:



313
314
315
# File 'lib/polars/data_frame.rb', line 313

def <(other)
  _comp(other, "lt")
end

#<=(other) ⇒ DataFrame

Less than or equal.

Returns:



327
328
329
# File 'lib/polars/data_frame.rb', line 327

def <=(other)
  _comp(other, "lt_eq")
end

#==(other) ⇒ DataFrame

Equal.

Returns:



292
293
294
# File 'lib/polars/data_frame.rb', line 292

def ==(other)
  _comp(other, "eq")
end

#>(other) ⇒ DataFrame

Greater than.

Returns:



306
307
308
# File 'lib/polars/data_frame.rb', line 306

def >(other)
  _comp(other, "gt")
end

#>=(other) ⇒ DataFrame

Greater than or equal.

Returns:



320
321
322
# File 'lib/polars/data_frame.rb', line 320

def >=(other)
  _comp(other, "gt_eq")
end

#[](*key) ⇒ Object

Returns subset of the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {"a" => [1, 2, 3], "d" => [4, 5, 6], "c" => [1, 3, 2], "b" => [7, 8, 9]}
)
df[0]
# =>
# shape: (1, 4)
# ┌─────┬─────┬─────┬─────┐
# │ a   ┆ d   ┆ c   ┆ b   │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ 1   ┆ 4   ┆ 1   ┆ 7   │
# └─────┴─────┴─────┴─────┘
df[0, "a"]
# => 1
df["a"]
# =>
# shape: (3,)
# Series: 'a' [i64]
# [
#         1
#         2
#         3
# ]
df[0..1]
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬─────┐
# │ a   ┆ d   ┆ c   ┆ b   │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ 1   ┆ 4   ┆ 1   ┆ 7   │
# │ 2   ┆ 5   ┆ 3   ┆ 8   │
# └─────┴─────┴─────┴─────┘
df[0..1, "a"]
# =>
# shape: (2,)
# Series: 'a' [i64]
# [
#         1
#         2
# ]
df[0..1, 0]
# =>
# shape: (2,)
# Series: 'a' [i64]
# [
#         1
#         2
# ]
df[[0, 1], [0, 1, 2]]
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ d   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 4   ┆ 1   │
# │ 2   ┆ 5   ┆ 3   │
# └─────┴─────┴─────┘
df[0..1, ["a", "c"]]
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ c   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 1   │
# │ 2   ┆ 3   │
# └─────┴─────┘
df[0.., 0..1]
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ d   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 4   │
# │ 2   ┆ 5   │
# │ 3   ┆ 6   │
# └─────┴─────┘
df[0.., "a".."c"]
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ d   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 4   ┆ 1   │
# │ 2   ┆ 5   ┆ 3   │
# │ 3   ┆ 6   ┆ 2   │
# └─────┴─────┴─────┘

Returns:



540
541
542
# File 'lib/polars/data_frame.rb', line 540

def [](*key)
  get_df_item_by_key(self, key)
end

#[]=(*key, value) ⇒ Object

Set item.

Examples:

df[["a", "b"]] = value:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [4, 5, 6]})
df[["a", "b"]] = [[10, 40], [20, 50], [30, 60]]
df
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 40  │
# │ 20  ┆ 50  │
# │ 30  ┆ 60  │
# └─────┴─────┘

df[row_idx, "a"] = value:

df[1, "a"] = 100
df
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 40  │
# │ 100 ┆ 50  │
# │ 30  ┆ 60  │
# └─────┴─────┘

df[row_idx, col_idx] = value:

df[0, 1] = 30
df
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 30  │
# │ 100 ┆ 50  │
# │ 30  ┆ 60  │
# └─────┴─────┘

Returns:



593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
# File 'lib/polars/data_frame.rb', line 593

def []=(*key, value)
  if key.empty? || key.length > 2
    raise ArgumentError, "wrong number of arguments (given #{key.length + 1}, expected 2..3)"
  end

  if key.length == 1 && Utils.strlike?(key[0])
    key = key[0]

    if value.is_a?(::Array) || (defined?(Numo::NArray) && value.is_a?(Numo::NArray))
      value = Series.new(value)
    elsif !value.is_a?(Series)
      value = Polars.lit(value)
    end
    self._df = with_columns(value.alias(key.to_s))._df

  # df[["C", "D"]]
  elsif key.length == 1 && key[0].is_a?(::Array)
    key = key[0]

    if !value.is_a?(::Array) || !value.all? { |v| v.is_a?(::Array) }
      msg = "can only set multiple columns with 2D matrix"
      raise ArgumentError, msg
    end
    if value.any? { |v| v.size != key.length }
      msg = "matrix columns should be equal to list used to determine column names"
      raise ArgumentError, msg
    end

    columns = []
    key.each_with_index do |name, i|
      columns << Series.new(name, value.map { |v| v[i] })
    end
    self._df = with_columns(columns)._df

  # df[a, b]
  else
    row_selection, col_selection = key

    if (row_selection.is_a?(Series) && row_selection.dtype == Boolean) || Utils.is_bool_sequence(row_selection)
      msg = (
        "not allowed to set DataFrame by boolean mask in the row position" +
        "\n\nConsider using `DataFrame.with_columns`."
      )
      raise TypeError, msg
    end

    # get series column selection
    if Utils.strlike?(col_selection)
      s = self[col_selection]
    elsif col_selection.is_a?(Integer)
      s = self[0.., col_selection]
    else
      msg = "unexpected column selection #{col_selection.inspect}"
      raise TypeError, msg
    end

    # dispatch to []= of Series to do modification
    s[row_selection] = value

    # now find the location to place series
    # df[idx]
    if col_selection.is_a?(Integer)
      replace_column(col_selection, s)
    # df["foo"]
    elsif Utils.strlike?(col_selection)
      _replace(col_selection.to_s, s)
    end
  end
end

#bottom_k(k, by:, reverse: false) ⇒ DataFrame

Return the k smallest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call sort after this function if you wish the output to be sorted.

Examples:

Get the rows which contain the 4 smallest values in column b.

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [2, 1, 1, 3, 2, 1]
  }
)
df.bottom_k(4, by: "b")
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 1   │
# │ a   ┆ 1   │
# │ c   ┆ 1   │
# │ a   ┆ 2   │
# └─────┴─────┘

Get the rows which contain the 4 smallest values when sorting on column a and b.

df.bottom_k(4, by: ["a", "b"])
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 1   │
# │ a   ┆ 2   │
# │ b   ┆ 1   │
# │ b   ┆ 2   │
# └─────┴─────┘

Parameters:

  • k (Integer)

    Number of rows to return.

  • by (Object)

    Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.

  • reverse (Object) (defaults to: false)

    Consider the k largest elements of the by column(s) (instead of the k smallest). This can be specified per column by passing an array of booleans.

Returns:



2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
# File 'lib/polars/data_frame.rb', line 2404

def bottom_k(
  k,
  by:,
  reverse: false
)
  lazy
  .bottom_k(k, by: by, reverse: reverse)
  .collect(
    optimizations: QueryOptFlags.new(
      projection_pushdown: false,
      predicate_pushdown: false,
      comm_subplan_elim: false,
      slice_pushdown: true
    )
  )
end

#cast(dtypes, strict: true) ⇒ DataFrame

Cast DataFrame column(s) to the specified dtype(s).

Examples:

Cast specific frame columns to the specified dtypes:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => [Date.new(2020, 1, 2), Date.new(2021, 3, 4), Date.new(2022, 5, 6)]
  }
)
df.cast({"foo" => Polars::Float32, "bar" => Polars::UInt8})
# =>
# shape: (3, 3)
# ┌─────┬─────┬────────────┐
# │ foo ┆ bar ┆ ham        │
# │ --- ┆ --- ┆ ---        │
# │ f32 ┆ u8  ┆ date       │
# ╞═════╪═════╪════════════╡
# │ 1.0 ┆ 6   ┆ 2020-01-02 │
# │ 2.0 ┆ 7   ┆ 2021-03-04 │
# │ 3.0 ┆ 8   ┆ 2022-05-06 │
# └─────┴─────┴────────────┘

Cast all frame columns matching one dtype (or dtype group) to another dtype:

df.cast({Polars::Date => Polars::Datetime})
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────────────────────┐
# │ foo ┆ bar ┆ ham                 │
# │ --- ┆ --- ┆ ---                 │
# │ i64 ┆ f64 ┆ datetime[μs]        │
# ╞═════╪═════╪═════════════════════╡
# │ 1   ┆ 6.0 ┆ 2020-01-02 00:00:00 │
# │ 2   ┆ 7.0 ┆ 2021-03-04 00:00:00 │
# │ 3   ┆ 8.0 ┆ 2022-05-06 00:00:00 │
# └─────┴─────┴─────────────────────┘

Cast all frame columns to the specified dtype:

df.cast(Polars::String).to_h(as_series: false)
# => {"foo"=>["1", "2", "3"], "bar"=>["6.0", "7.0", "8.0"], "ham"=>["2020-01-02", "2021-03-04", "2022-05-06"]}

Parameters:

  • dtypes (Object)

    Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.

  • strict (Boolean) (defaults to: true)

    Throw an error if a cast could not be done (for instance, due to an overflow).

Returns:



4022
4023
4024
# File 'lib/polars/data_frame.rb', line 4022

def cast(dtypes, strict: true)
  lazy.cast(dtypes, strict: strict).collect(optimizations: QueryOptFlags._eager)
end

#clear(n = 0) ⇒ DataFrame

Create an empty copy of the current DataFrame.

Returns a DataFrame with identical schema but no data.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.clear
# =>
# shape: (0, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ f64 ┆ bool │
# ╞═════╪═════╪══════╡
# └─────┴─────┴──────┘
df.clear(2)
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ f64  ┆ bool │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Returns:



4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
# File 'lib/polars/data_frame.rb', line 4062

def clear(n = 0)
  if n == 0
    _from_rbdf(_df.clear)
  elsif n > 0 || len > 0
    self.class.new(
      schema.to_h { |nm, tp| [nm, Series.new(nm, [], dtype: tp).extend_constant(nil, n)] }
    )
  else
    clone
  end
end

#collect_schemaSchema

Note:

This method is included to facilitate writing code that is generic for both DataFrame and LazyFrame.

Get an ordered mapping of column names to their data type.

Examples:

Determine the schema.

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.collect_schema
# => Polars::Schema({"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String})

Access various properties of the schema using the Schema object.

schema = df.collect_schema
schema["bar"]
# => Polars::Float64
schema.names
# => ["foo", "bar", "ham"]
schema.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]
schema.length
# => 3

Returns:



703
704
705
# File 'lib/polars/data_frame.rb', line 703

def collect_schema
  Schema.new(columns.zip(dtypes), check_dtypes: false)
end

#columnsArray

Get column names.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.columns
# => ["foo", "bar", "ham"]

Returns:



209
210
211
# File 'lib/polars/data_frame.rb', line 209

def columns
  _df.columns
end

#columns=(columns) ⇒ Object

Change the column names of the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.columns = ["apple", "banana", "orange"]
df
# =>
# shape: (3, 3)
# ┌───────┬────────┬────────┐
# │ apple ┆ banana ┆ orange │
# │ ---   ┆ ---    ┆ ---    │
# │ i64   ┆ i64    ┆ str    │
# ╞═══════╪════════╪════════╡
# │ 1     ┆ 6      ┆ a      │
# │ 2     ┆ 7      ┆ b      │
# │ 3     ┆ 8      ┆ c      │
# └───────┴────────┴────────┘

Parameters:

  • columns (Array)

    A list with new names for the DataFrame. The length of the list should be equal to the width of the DataFrame.

Returns:



242
243
244
# File 'lib/polars/data_frame.rb', line 242

def columns=(columns)
  _df.set_column_names(columns)
end

#delete(name) ⇒ Series

Drop in place if exists.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.delete("ham")
# =>
# shape: (3,)
# Series: 'ham' [str]
# [
#         "a"
#         "b"
#         "c"
# ]
df.delete("missing")
# => nil

Parameters:

  • name (Object)

    Column to drop.

Returns:



3969
3970
3971
# File 'lib/polars/data_frame.rb', line 3969

def delete(name)
  drop_in_place(name) if include?(name)
end

#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame

Summary statistics for a DataFrame.

Examples:

Show default frame statistics:

df = Polars::DataFrame.new(
  {
    "float" => [1.0, 2.8, 3.0],
    "int" => [40, 50, nil],
    "bool" => [true, false, true],
    "str" => ["zz", "xx", "yy"],
    "date" => [Date.new(2020, 1, 1), Date.new(2021, 7, 5), Date.new(2022, 12, 31)]
  }
)
df.describe
# =>
# shape: (9, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────────┐
# │ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                     │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                     │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════════╡
# │ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                       │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                       │
# │ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 UTC │
# │ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                    │
# │ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01              │
# │ 25%        ┆ 2.8      ┆ 40.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 50%        ┆ 2.8      ┆ 50.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 75%        ┆ 3.0      ┆ 50.0     ┆ null     ┆ null ┆ 2022-12-31              │
# │ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31              │
# └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────────┘

Customize which percentiles are displayed, applying linear interpolation:

df.describe(
  percentiles: [0.1, 0.3, 0.5, 0.7, 0.9],
  interpolation: "linear"
)
# =>
# shape: (11, 6)
# ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────────┐
# │ statistic  ┆ float    ┆ int      ┆ bool     ┆ str  ┆ date                    │
# │ ---        ┆ ---      ┆ ---      ┆ ---      ┆ ---  ┆ ---                     │
# │ str        ┆ f64      ┆ f64      ┆ f64      ┆ str  ┆ str                     │
# ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════════╡
# │ count      ┆ 3.0      ┆ 2.0      ┆ 3.0      ┆ 3    ┆ 3                       │
# │ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0      ┆ 0    ┆ 0                       │
# │ mean       ┆ 2.266667 ┆ 45.0     ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 UTC │
# │ std        ┆ 1.101514 ┆ 7.071068 ┆ null     ┆ null ┆ null                    │
# │ min        ┆ 1.0      ┆ 40.0     ┆ 0.0      ┆ xx   ┆ 2020-01-01              │
# │ …          ┆ …        ┆ …        ┆ …        ┆ …    ┆ …                       │
# │ 30%        ┆ 2.08     ┆ 43.0     ┆ null     ┆ null ┆ 2020-11-26              │
# │ 50%        ┆ 2.8      ┆ 45.0     ┆ null     ┆ null ┆ 2021-07-05              │
# │ 70%        ┆ 2.88     ┆ 47.0     ┆ null     ┆ null ┆ 2022-02-07              │
# │ 90%        ┆ 2.96     ┆ 49.0     ┆ null     ┆ null ┆ 2022-09-13              │
# │ max        ┆ 3.0      ┆ 50.0     ┆ 1.0      ┆ zz   ┆ 2022-12-31              │
# └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────────┘

Parameters:

  • percentiles (Array) (defaults to: [0.25, 0.5, 0.75])

    One or more percentiles to include in the summary statistics. All values must be in the range [0, 1].

  • interpolation ('nearest', 'higher', 'lower', 'midpoint', 'linear', 'equiprobable') (defaults to: "nearest")

    Interpolation method used when calculating percentiles.

Returns:



2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
# File 'lib/polars/data_frame.rb', line 2044

def describe(
  percentiles: [0.25, 0.5, 0.75],
  interpolation: "nearest"
)
  if columns.empty?
    msg = "cannot describe a DataFrame that has no columns"
    raise TypeError, msg
  end

  lazy.describe(
    percentiles: percentiles, interpolation: interpolation
  )
end

#drop(*columns, strict: true) ⇒ DataFrame

Remove column from DataFrame and return as new.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.drop("ham")
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ f64 │
# ╞═════╪═════╡
# │ 1   ┆ 6.0 │
# │ 2   ┆ 7.0 │
# │ 3   ┆ 8.0 │
# └─────┴─────┘

Drop multiple columns by passing a list of column names.

df.drop(["bar", "ham"])
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘

Use positional arguments to drop multiple columns.

df.drop("foo", "ham")
# =>
# shape: (3, 1)
# ┌─────┐
# │ bar │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 6.0 │
# │ 7.0 │
# │ 8.0 │
# └─────┘

Parameters:

  • columns (Object)

    Column(s) to drop.

  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not.

Returns:



3909
3910
3911
# File 'lib/polars/data_frame.rb', line 3909

def drop(*columns, strict: true)
  lazy.drop(*columns, strict: strict).collect(optimizations: QueryOptFlags._eager)
end

#drop_in_place(name) ⇒ Series

Drop in place.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.drop_in_place("ham")
# =>
# shape: (3,)
# Series: 'ham' [str]
# [
#         "a"
#         "b"
#         "c"
# ]

Parameters:

  • name (Object)

    Column to drop.

Returns:



3937
3938
3939
# File 'lib/polars/data_frame.rb', line 3937

def drop_in_place(name)
  Utils.wrap_s(_df.drop_in_place(name))
end

#drop_nans(subset: nil) ⇒ DataFrame

Drop all rows that contain one or more NaN values.

The original order of the remaining rows is preserved.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [-20.5, Float::NAN, 80.0],
    "bar" => [Float::NAN, 110.0, 25.5],
    "ham" => ["xxx", "yyy", nil]
  }
)
df.drop_nans
# =>
# shape: (1, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ f64  ┆ f64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 80.0 ┆ 25.5 ┆ null │
# └──────┴──────┴──────┘
df.drop_nans(subset: ["bar"])
# =>
# shape: (2, 3)
# ┌──────┬───────┬──────┐
# │ foo  ┆ bar   ┆ ham  │
# │ ---  ┆ ---   ┆ ---  │
# │ f64  ┆ f64   ┆ str  │
# ╞══════╪═══════╪══════╡
# │ NaN  ┆ 110.0 ┆ yyy  │
# │ 80.0 ┆ 25.5  ┆ null │
# └──────┴───────┴──────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Column name(s) for which NaN values are considered; if set to nil (default), use all columns (note that only floating-point columns can contain NaNs).

Returns:



2623
2624
2625
# File 'lib/polars/data_frame.rb', line 2623

def drop_nans(subset: nil)
  lazy.drop_nans(subset: subset).collect(optimizations: QueryOptFlags._eager)
end

#drop_nulls(subset: nil) ⇒ DataFrame

Drop all rows that contain one or more null values.

The original order of the remaining rows is preserved.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, nil, 8],
    "ham" => ["a", "b", nil]
  }
)
df.drop_nulls
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘
df.drop_nulls(subset: Polars.cs.integer)
# =>
# shape: (2, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1   ┆ 6   ┆ a    │
# │ 3   ┆ 8   ┆ null │
# └─────┴─────┴──────┘

Parameters:

  • subset (Object) (defaults to: nil)

    Column name(s) for which null values are considered. If set to nil (default), use all columns.

Returns:



2668
2669
2670
# File 'lib/polars/data_frame.rb', line 2668

def drop_nulls(subset: nil)
  lazy.drop_nulls(subset: subset).collect(optimizations: QueryOptFlags._eager)
end

#dtypesArray

Get dtypes of columns in DataFrame. Dtypes can also be found in column headers when printing the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]

Returns:



260
261
262
# File 'lib/polars/data_frame.rb', line 260

def dtypes
  _df.dtypes
end

#each(&block) ⇒ Object

Returns an enumerator.

Returns:



416
417
418
# File 'lib/polars/data_frame.rb', line 416

def each(&block)
  get_columns.each(&block)
end

#each_row(named: true, buffer_size: 500, &block) ⇒ Object

Returns an iterator over the DataFrame of rows of Ruby-native values.

Parameters:

  • named (Boolean) (defaults to: true)

    Return hashes instead of arrays. The hashes are a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

  • buffer_size (Integer) (defaults to: 500)

    Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering.

Returns:



6062
6063
6064
# File 'lib/polars/data_frame.rb', line 6062

def each_row(named: true, buffer_size: 500, &block)
  iter_rows(named: named, buffer_size: buffer_size, &block)
end

#equals(other, null_equal: true) ⇒ Boolean

Check if DataFrame is equal to other.

Examples:

df1 = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df2 = Polars::DataFrame.new(
  {
    "foo" => [3, 2, 1],
    "bar" => [8.0, 7.0, 6.0],
    "ham" => ["c", "b", "a"]
  }
)
df1.equals(df1)
# => true
df1.equals(df2)
# => false

Parameters:

  • other (DataFrame)

    DataFrame to compare with.

  • null_equal (Boolean) (defaults to: true)

    Consider null values as equal.

Returns:



2449
2450
2451
# File 'lib/polars/data_frame.rb', line 2449

def equals(other, null_equal: true)
  _df.equals(other._df, null_equal)
end

#estimated_size(unit = "b") ⇒ Numeric

Return an estimation of the total (heap) allocated size of the DataFrame.

Estimated size is given in the specified unit (bytes by default).

This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, StructArray's size is an upper bound.

When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.

FFI buffers are included in this estimation.

Examples:

df = Polars::DataFrame.new(
  {
    "x" => 1_000_000.times.to_a.reverse,
    "y" => 1_000_000.times.map { |v| v / 1000.0 },
    "z" => 1_000_000.times.map(&:to_s)
  },
  schema: {"x" => Polars::UInt32, "y" => Polars::Float64, "z" => Polars::String}
)
df.estimated_size
# => 25888898
df.estimated_size("mb")
# => 17.0601749420166

Parameters:

  • unit ("b", "kb", "mb", "gb", "tb") (defaults to: "b")

    Scale the returned size to the given unit.

Returns:

  • (Numeric)


1537
1538
1539
1540
# File 'lib/polars/data_frame.rb', line 1537

def estimated_size(unit = "b")
  sz = _df.estimated_size
  Utils.scale_bytes(sz, to: unit)
end

#explode(columns, *more_columns) ⇒ DataFrame

Explode DataFrame to long format by exploding a column with Lists.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["a", "a", "b", "c"],
    "numbers" => [[1], [2, 3], [4, 5], [6, 7, 8]]
  }
)
df.explode("numbers")
# =>
# shape: (8, 2)
# ┌─────────┬─────────┐
# │ letters ┆ numbers │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# │ a       ┆ 2       │
# │ a       ┆ 3       │
# │ b       ┆ 4       │
# │ b       ┆ 5       │
# │ c       ┆ 6       │
# │ c       ┆ 7       │
# │ c       ┆ 8       │
# └─────────┴─────────┘

Parameters:

  • columns (Object)

    Column of LargeList type.

  • more_columns (Array)

    Additional names of columns to explode, specified as positional arguments.

Returns:



4329
4330
4331
# File 'lib/polars/data_frame.rb', line 4329

def explode(columns, *more_columns)
  lazy.explode(columns, *more_columns).collect(optimizations: QueryOptFlags._eager)
end

#extend(other) ⇒ DataFrame

Extend the memory backed by this DataFrame with the values from other.

Different from vstack which adds the chunks from other to the chunks of this DataFrame extend appends the data from other to the underlying memory locations and thus may cause a reallocation.

If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.

Prefer extend over vstack when you want to do a query after a single append. For instance during online operations where you add n rows and rerun a query.

Prefer vstack over extend when you want to append many times before doing a query. For instance when you read in multiple files and when to store them in a single DataFrame. In the latter case, finish the sequence of vstack operations with a rechunk.

Examples:

df1 = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df2 = Polars::DataFrame.new({"foo" => [10, 20, 30], "bar" => [40, 50, 60]})
df1.extend(df2)
# =>
# shape: (6, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 4   │
# │ 2   ┆ 5   │
# │ 3   ┆ 6   │
# │ 10  ┆ 40  │
# │ 20  ┆ 50  │
# │ 30  ┆ 60  │
# └─────┴─────┘

Parameters:

  • other (DataFrame)

    DataFrame to vertically add.

Returns:



3846
3847
3848
3849
# File 'lib/polars/data_frame.rb', line 3846

def extend(other)
  _df.extend(other._df)
  self
end

#fill_nan(value) ⇒ DataFrame

Note:

Note that floating point NaNs (Not a Number) are not missing values! To replace missing values, use fill_null.

Fill floating point NaN values by an Expression evaluation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13]
  }
)
df.fill_nan(99)
# =>
# shape: (4, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ f64  ┆ f64  │
# ╞══════╪══════╡
# │ 1.5  ┆ 0.5  │
# │ 2.0  ┆ 4.0  │
# │ 99.0 ┆ 99.0 │
# │ 4.0  ┆ 13.0 │
# └──────┴──────┘

Parameters:

  • value (Object)

    Value to fill NaN with.

Returns:



4292
4293
4294
# File 'lib/polars/data_frame.rb', line 4292

def fill_nan(value)
  lazy.fill_nan(value).collect(optimizations: QueryOptFlags._eager)
end

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame

Fill null values using the specified value or strategy.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, nil, 4],
    "b" => [0.5, 4, nil, 13]
  }
)
df.fill_null(99)
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 99  ┆ 99.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
df.fill_null(strategy: "forward")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
df.fill_null(strategy: "max")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 4   ┆ 13.0 │
# │ 4   ┆ 13.0 │
# └─────┴──────┘
df.fill_null(strategy: "zero")
# =>
# shape: (4, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ 0.5  │
# │ 2   ┆ 4.0  │
# │ 0   ┆ 0.0  │
# │ 4   ┆ 13.0 │
# └─────┴──────┘

Parameters:

  • value (Numeric) (defaults to: nil)

    Value used to fill null values.

  • strategy (nil, "forward", "backward", "min", "max", "mean", "zero", "one") (defaults to: nil)

    Strategy used to fill null values.

  • limit (Integer) (defaults to: nil)

    Number of consecutive null values to fill when using the 'forward' or 'backward' strategy.

  • matches_supertype (Boolean) (defaults to: true)

    Fill all matching supertype of the fill value.

Returns:



4252
4253
4254
4255
4256
4257
4258
4259
# File 'lib/polars/data_frame.rb', line 4252

def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true)
  _from_rbdf(
    lazy
      .fill_null(value, strategy: strategy, limit: limit, matches_supertype: matches_supertype)
      .collect(optimizations: QueryOptFlags._eager)
      ._df
  )
end

#filter(*predicates, **constraints) ⇒ DataFrame

Filter the rows in the DataFrame based on a predicate expression.

Examples:

Filter on one condition:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.filter(Polars.col("foo") < 3)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Filter on multiple conditions:

df.filter((Polars.col("foo") < 3) & (Polars.col("ham") == "a"))
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • predicates (Array)

    Expression(s) that evaluate to a boolean Series.

  • constraints (Hash)

    Column filters; use name = value to filter columns by the supplied value. Each constraint will behave the same as Polars.col(name).eq(value), and be implicitly joined with the other filter conditions using &.

Returns:



1763
1764
1765
# File 'lib/polars/data_frame.rb', line 1763

def filter(*predicates, **constraints)
  lazy.filter(*predicates, **constraints).collect(optimizations: QueryOptFlags._eager)
end

#flagsHash

Get flags that are set on the columns of this DataFrame.

Returns:

  • (Hash)


267
268
269
# File 'lib/polars/data_frame.rb', line 267

def flags
  columns.to_h { |name| [name, self[name].flags] }
end

#foldSeries

Apply a horizontal reduction on a DataFrame.

This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).

An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:

i8 + str = str f32 + i64 = f32 f32 + f64 = f64

Examples:

A horizontal sum operation:

df = Polars::DataFrame.new(
  {
    "a" => [2, 1, 3],
    "b" => [1, 2, 3],
    "c" => [1.0, 2.0, 3.0]
  }
)
df.fold { |s1, s2| s1 + s2 }
# =>
# shape: (3,)
# Series: 'a' [f64]
# [
#         4.0
#         5.0
#         9.0
# ]

A horizontal minimum operation:

df = Polars::DataFrame.new({"a" => [2, 1, 3], "b" => [1, 2, 3], "c" => [1.0, 2.0, 3.0]})
df.fold { |s1, s2| s1.zip_with(s1 < s2, s2) }
# =>
# shape: (3,)
# Series: 'a' [f64]
# [
#         1.0
#         1.0
#         3.0
# ]

A horizontal string concatenation:

df = Polars::DataFrame.new(
  {
    "a" => ["foo", "bar", nil],
    "b" => [1, 2, 3],
    "c" => [1.0, 2.0, 3.0]
  }
)
df.fold { |s1, s2| s1 + s2 }
# =>
# shape: (3,)
# Series: 'a' [str]
# [
#         "foo11.0"
#         "bar22.0"
#         null
# ]

A horizontal boolean or, similar to a row-wise .any:

df = Polars::DataFrame.new(
  {
    "a" => [false, false, true],
    "b" => [false, true, false]
  }
)
df.fold { |s1, s2| s1 | s2 }
# =>
# shape: (3,)
# Series: 'a' [bool]
# [
#         false
#         true
#         true
# ]

Returns:



5792
5793
5794
5795
5796
5797
5798
5799
# File 'lib/polars/data_frame.rb', line 5792

def fold
  acc = to_series(0)

  1.upto(width - 1) do |i|
    acc = yield(acc, to_series(i))
  end
  acc
end

#gather_every(n, offset = 0) ⇒ DataFrame

Take every nth row in the DataFrame and return as a new DataFrame.

Examples:

s = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [5, 6, 7, 8]})
s.gather_every(2)
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 7   │
# └─────┴─────┘

Returns:



6183
6184
6185
# File 'lib/polars/data_frame.rb', line 6183

def gather_every(n, offset = 0)
  select(F.col("*").gather_every(n, offset))
end

#get_column(name, default: NO_DEFAULT) ⇒ Series

Get a single column by name.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.get_column("foo")
# =>
# shape: (3,)
# Series: 'foo' [i64]
# [
#         1
#         2
#         3
# ]
df.get_column("baz", default: Polars::Series.new("baz", ["?", "?", "?"]))
# =>
# shape: (3,)
# Series: 'baz' [str]
# [
#         "?"
#         "?"
#         "?"
# ]

Parameters:

  • name (String)

    Name of the column to retrieve.

  • default (Object) (defaults to: NO_DEFAULT)

    Value to return if the column does not exist; if not explicitly set and the column is not present a ColumnNotFoundError exception is raised.

Returns:



4166
4167
4168
4169
4170
4171
# File 'lib/polars/data_frame.rb', line 4166

def get_column(name, default: NO_DEFAULT)
  Utils.wrap_s(_df.get_column(name.to_s))
rescue ColumnNotFoundError
  raise if default.eql?(NO_DEFAULT)
  default
end

#get_column_index(name) ⇒ Series

Find the index of a column by name.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3], "bar" => [6, 7, 8], "ham" => ["a", "b", "c"]}
)
df.get_column_index("ham")
# => 2

Parameters:

  • name (String)

    Name of the column to find.

Returns:



2071
2072
2073
# File 'lib/polars/data_frame.rb', line 2071

def get_column_index(name)
  _df.get_column_index(name)
end

#get_columnsArray

Get the DataFrame as a Array of Series.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.get_columns
# =>
# [shape: (3,)
# Series: 'foo' [i64]
# [
#         1
#         2
#         3
# ], shape: (3,)
# Series: 'bar' [i64]
# [
#         4
#         5
#         6
# ]]
df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.get_columns
# =>
# [shape: (4,)
# Series: 'a' [i64]
# [
#         1
#         2
#         3
#         4
# ], shape: (4,)
# Series: 'b' [f64]
# [
#         0.5
#         4.0
#         10.0
#         13.0
# ], shape: (4,)
# Series: 'c' [bool]
# [
#         true
#         true
#         false
#         true
# ]]

Returns:



4130
4131
4132
# File 'lib/polars/data_frame.rb', line 4130

def get_columns
  _df.get_columns.map { |s| Utils.wrap_s(s) }
end

#glimpse(max_items_per_column: 10, max_colname_length: 50, return_type: nil) ⇒ Object

Return a dense preview of the DataFrame.

The formatting shows one line per column so that wide dataframes display cleanly. Each line shows the column name, the data type, and the first few values.

Examples:

Return the glimpse output as a DataFrame:

df = Polars::DataFrame.new(
  {
    "a" => [1.0, 2.8, 3.0],
    "b" => [4, 5, nil],
    "c" => [true, false, true],
    "d" => [nil, "b", "c"],
    "e" => ["usd", "eur", nil],
    "f" => [Date.new(2020, 1, 1), Date.new(2021, 1, 2), Date.new(2022, 1, 1)]
  }
)
df.glimpse(return_type: "frame")
# =>
# shape: (6, 3)
# ┌────────┬───────┬─────────────────────────────────┐
# │ column ┆ dtype ┆ values                          │
# │ ---    ┆ ---   ┆ ---                             │
# │ str    ┆ str   ┆ list[str]                       │
# ╞════════╪═══════╪═════════════════════════════════╡
# │ a      ┆ f64   ┆ ["1.0", "2.8", "3.0"]           │
# │ b      ┆ i64   ┆ ["4", "5", null]                │
# │ c      ┆ bool  ┆ ["true", "false", "true"]       │
# │ d      ┆ str   ┆ [null, ""b"", ""c""]            │
# │ e      ┆ str   ┆ [""usd"", ""eur"", null]        │
# │ f      ┆ date  ┆ ["2020-01-01", "2021-01-02", "… │
# └────────┴───────┴─────────────────────────────────┘

Parameters:

  • max_items_per_column (Integer) (defaults to: 10)

    Maximum number of items to show per column.

  • max_colname_length (Integer) (defaults to: 50)

    Maximum length of the displayed column names; values that exceed this value are truncated with a trailing ellipsis.

  • return_type (nil, 'self', 'frame', 'string') (defaults to: nil)

    Modify the return format:

    • nil (default): Print the glimpse output to stdout, returning nil.
    • "self": Print the glimpse output to stdout, returning the original frame.
    • "frame": Return the glimpse output as a new DataFrame.
    • "string": Return the glimpse output as a string.

Returns:



1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
# File 'lib/polars/data_frame.rb', line 1934

def glimpse(
  max_items_per_column: 10,
  max_colname_length: 50,
  return_type: nil
)
  if return_type.nil?
    return_frame = false
  else
    return_frame = return_type == "frame"
    if !return_frame && !["self", "string"].include?(return_type)
      msg = "invalid `return_type`; found #{return_type.inspect}, expected one of 'string', 'frame', 'self', or nil"
      raise ArgumentError, msg
    end
  end

  # always print at most this number of values (mainly ensures that
  # we do not cast long arrays to strings, which would be slow)
  max_n_values = [max_items_per_column, height].min
  schema = self.schema

  _column_to_row_output = lambda do |col_name, dtype|
    fn = schema[col_name] == String ? :inspect : :to_s
    values = self[0...max_n_values, col_name].to_a
    if col_name.length > max_colname_length
      col_name = col_name[0...(max_colname_length - 1)] + "…"
    end
    dtype_str = Plr.dtype_str_repr(dtype)
    if !return_frame
      dtype_str = "<#{dtype_str}>"
    end
    [col_name, dtype_str, values.map { |v| !v.nil? ? v.send(fn) : nil }]
  end

  data = self.schema.map { |s, dtype| _column_to_row_output.(s, dtype) }

  # output one row per column
  if return_frame
    DataFrame.new(
      data,
      orient: "row",
      schema: {"column" => String, "dtype" => String, "values" => List.new(String)}
    )
  else
    raise Todo
  end
end

#group_by(by, maintain_order: false, **named_by) ⇒ GroupBy

Start a group by operation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
)
df.group_by("a").agg(Polars.col("b").sum).sort("a")
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 4   │
# │ b   ┆ 11  │
# │ c   ┆ 6   │
# └─────┴─────┘

Parameters:

  • by (Object)

    Column(s) to group by.

  • maintain_order (Boolean) (defaults to: false)

    Make sure that the order of the groups remain consistent. This is more expensive than a default group by. Note that this only works in expression aggregations.

  • named_by (Hash)

    Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
# File 'lib/polars/data_frame.rb', line 2778

def group_by(by, maintain_order: false, **named_by)
  named_by.each do |_, value|
    if !(value.is_a?(::String) || value.is_a?(Expr) || value.is_a?(Series))
      msg = "Expected Polars expression or object convertible to one, got #{value.class.name}."
      raise TypeError, msg
    end
  end
  GroupBy.new(
    self,
    by,
    **named_by,
    maintain_order: maintain_order
  )
end

#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame

Group based on a time value (or index value of type Int32, Int64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

  • every: interval of the window
  • period: length of the window
  • offset: offset of the window

The every, period and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_dynamic on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "n" => 0..6
  }
)
# =>
# shape: (7, 2)
# ┌─────────────────────┬─────┐
# │ time                ┆ n   │
# │ ---                 ┆ --- │
# │ datetime[μs]        ┆ i64 │
# ╞═════════════════════╪═════╡
# │ 2021-12-16 00:00:00 ┆ 0   │
# │ 2021-12-16 00:30:00 ┆ 1   │
# │ 2021-12-16 01:00:00 ┆ 2   │
# │ 2021-12-16 01:30:00 ┆ 3   │
# │ 2021-12-16 02:00:00 ┆ 4   │
# │ 2021-12-16 02:30:00 ┆ 5   │
# │ 2021-12-16 03:00:00 ┆ 6   │
# └─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

df.group_by_dynamic("time", every: "1h", closed: "right").agg(
  [
    Polars.col("time").min.alias("time_min"),
    Polars.col("time").max.alias("time_max")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬─────────────────────┬─────────────────────┐
# │ time                ┆ time_min            ┆ time_max            │
# │ ---                 ┆ ---                 ┆ ---                 │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result.

df.group_by_dynamic(
  "time", every: "1h", include_boundaries: true, closed: "right"
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (4, 4)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed="left", should not include right end of interval.

df.group_by_dynamic("time", every: "1h", closed: "left").agg(
  [
    Polars.col("time").count.alias("time_count"),
    Polars.col("time").alias("time_agg_list")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬─────────────────────────────────┐
# │ time                ┆ time_count ┆ time_agg_list                   │
# │ ---                 ┆ ---        ┆ ---                             │
# │ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]              │
# ╞═════════════════════╪════════════╪═════════════════════════════════╡
# │ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-… │
# │ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-… │
# │ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-… │
# │ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]           │
# └─────────────────────┴────────────┴─────────────────────────────────┘

When closed="both" the time values at the window boundaries belong to 2 groups.

df.group_by_dynamic("time", every: "1h", closed: "both").agg(
  [Polars.col("time").count.alias("time_count")]
)
# =>
# shape: (5, 2)
# ┌─────────────────────┬────────────┐
# │ time                ┆ time_count │
# │ ---                 ┆ ---        │
# │ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 3          │
# │ 2021-12-16 01:00:00 ┆ 3          │
# │ 2021-12-16 02:00:00 ┆ 3          │
# │ 2021-12-16 03:00:00 ┆ 1          │
# └─────────────────────┴────────────┘

Dynamic group bys can also be combined with grouping on normal keys.

df = Polars::DataFrame.new(
  {
    "time" => Polars.datetime_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m",
      time_unit: "us",
      eager: true
    ),
    "groups" => ["a", "a", "a", "b", "b", "a", "a"]
  }
)
df.group_by_dynamic(
  "time",
  every: "1h",
  closed: "both",
  group_by: "groups",
  include_boundaries: true
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (7, 5)
# ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
# │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
# │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
# └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic group by on an index column.

df = Polars::DataFrame.new(
  {
    "idx" => Polars.arange(0, 6, eager: true),
    "A" => ["A", "A", "B", "B", "B", "C"]
  }
)
df.group_by_dynamic(
  "idx",
  every: "2i",
  period: "3i",
  include_boundaries: true,
  closed: "right"
).agg(Polars.col("A").alias("A_agg_list"))
# =>
# shape: (4, 4)
# ┌─────────────────┬─────────────────┬─────┬─────────────────┐
# │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
# │ ---             ┆ ---             ┆ --- ┆ ---             │
# │ i64             ┆ i64             ┆ i64 ┆ list[str]       │
# ╞═════════════════╪═════════════════╪═════╪═════════════════╡
# │ -2              ┆ 1               ┆ -2  ┆ ["A", "A"]      │
# │ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
# │ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
# │ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
# └─────────────────┴─────────────────┴─────┴─────────────────┘

Parameters:

  • index_column

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a dynamic group by on indices, dtype needs to be one of \{Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

  • every

    Interval of the window.

  • period (defaults to: nil)

    Length of the window, if nil it is equal to 'every'.

  • offset (defaults to: nil)

    Offset of the window if nil and period is nil it will be equal to negative every.

  • include_boundaries (defaults to: false)

    Add the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize

  • closed ("right", "left", "both", "none") (defaults to: "left")

    Define whether the temporal window interval is closed or not.

  • label ('left', 'right', 'datapoint') (defaults to: "left")

    Define which label to use for the window:

    • 'left': lower boundary of the window
    • 'right': upper boundary of the window
    • 'datapoint': the first value of the index column in the given window. If you don't need the label to be at one of the boundaries, choose this option for maximum performance
  • group_by (defaults to: nil)

    Also group by this column/these columns

  • start_by ('window', 'datapoint', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday') (defaults to: "window")

    The strategy to determine the start of the first window by.

    • 'window': Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
    • 'datapoint': Start from the first encountered data point.
    • a day of the week (only takes effect if every contains 'w'):

      • 'monday': Start the window on the Monday before the first data point.
      • 'tuesday': Start the window on the Tuesday before the first data point.
      • ...
      • 'sunday': Start the window on the Sunday before the first data point.

    The resulting window is then shifted back until the earliest datapoint is in or in front of it.

Returns:



3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
# File 'lib/polars/data_frame.rb', line 3138

def group_by_dynamic(
  index_column,
  every:,
  period: nil,
  offset: nil,
  include_boundaries: false,
  closed: "left",
  label: "left",
  group_by: nil,
  start_by: "window"
)
  DynamicGroupBy.new(
    self,
    index_column,
    every,
    period,
    offset,
    include_boundaries,
    closed,
    label,
    group_by,
    start_by
  )
end

#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series

Hash and combine the rows in this DataFrame.

The hash value is of type UInt64.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 3, 4],
    "ham" => ["a", "b", nil, "d"]
  }
)
df.hash_rows(seed: 42)
# =>
# shape: (4,)
# Series: '' [u64]
# [
#         4238614331852490969
#         17976148875586754089
#         4702262519505526977
#         18144177983981041107
# ]

Parameters:

  • seed (Integer) (defaults to: 0)

    Random seed parameter. Defaults to 0.

  • seed_1 (Integer) (defaults to: nil)

    Random seed parameter. Defaults to seed if not set.

  • seed_2 (Integer) (defaults to: nil)

    Random seed parameter. Defaults to seed if not set.

  • seed_3 (Integer) (defaults to: nil)

    Random seed parameter. Defaults to seed if not set.

Returns:



6219
6220
6221
6222
6223
6224
6225
# File 'lib/polars/data_frame.rb', line 6219

def hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil)
  k0 = seed
  k1 = seed_1.nil? ? seed : seed_1
  k2 = seed_2.nil? ? seed : seed_2
  k3 = seed_3.nil? ? seed : seed_3
  Utils.wrap_s(_df.hash_rows(k0, k1, k2, k3))
end

#head(n = 5) ⇒ DataFrame

Get the first n rows.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.head(3)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



2546
2547
2548
# File 'lib/polars/data_frame.rb', line 2546

def head(n = 5)
  _from_rbdf(_df.head(n))
end

#heightInteger Also known as: count, length, size

Get the height of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.height
# => 5

Returns:

  • (Integer)


176
177
178
# File 'lib/polars/data_frame.rb', line 176

def height
  _df.height
end

#hstack(columns, in_place: false) ⇒ DataFrame

Return a new DataFrame grown horizontally by stacking multiple Series to it.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
x = Polars::Series.new("apple", [10, 20, 30])
df.hstack([x])
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ i64 ┆ str ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6   ┆ a   ┆ 10    │
# │ 2   ┆ 7   ┆ b   ┆ 20    │
# │ 3   ┆ 8   ┆ c   ┆ 30    │
# └─────┴─────┴─────┴───────┘

Parameters:

  • columns (Object)

    Series to stack.

  • in_place (Boolean) (defaults to: false)

    Modify in place.

Returns:



3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
# File 'lib/polars/data_frame.rb', line 3748

def hstack(columns, in_place: false)
  if !columns.is_a?(::Array)
    columns = columns.get_columns
  end
  if in_place
    _df.hstack_mut(columns.map(&:_s))
    self
  else
    _from_rbdf(_df.hstack(columns.map(&:_s)))
  end
end

#include?(name) ⇒ Boolean

Check if DataFrame includes column.

Returns:



409
410
411
# File 'lib/polars/data_frame.rb', line 409

def include?(name)
  columns.include?(name)
end

#insert_column(index, column) ⇒ DataFrame

Insert a Series at a certain column index. This operation is in place.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
s = Polars::Series.new("baz", [97, 98, 99])
df.insert_column(1, s)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ baz ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 97  ┆ 4   │
# │ 2   ┆ 98  ┆ 5   │
# │ 3   ┆ 99  ┆ 6   │
# └─────┴─────┴─────┘
df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
s = Polars::Series.new("d", [-2.5, 15, 20.5, 0])
df.insert_column(3, s)
# =>
# shape: (4, 4)
# ┌─────┬──────┬───────┬──────┐
# │ a   ┆ b    ┆ c     ┆ d    │
# │ --- ┆ ---  ┆ ---   ┆ ---  │
# │ i64 ┆ f64  ┆ bool  ┆ f64  │
# ╞═════╪══════╪═══════╪══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ -2.5 │
# │ 2   ┆ 4.0  ┆ true  ┆ 15.0 │
# │ 3   ┆ 10.0 ┆ false ┆ 20.5 │
# │ 4   ┆ 13.0 ┆ true  ┆ 0.0  │
# └─────┴──────┴───────┴──────┘

Parameters:

  • index (Integer)

    Column to insert the new Series column.

  • column (Series)

    Series to insert.

Returns:



1713
1714
1715
1716
1717
1718
1719
# File 'lib/polars/data_frame.rb', line 1713

def insert_column(index, column)
  if index < 0
    index = width + index
  end
  _df.insert_column(index, column._s)
  self
end

#interpolateDataFrame

Interpolate intermediate values. The interpolation method is linear.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 9, 10],
    "bar" => [6, 7, 9, nil],
    "baz" => [1, nil, nil, 9]
  }
)
df.interpolate
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────────┐
# │ foo  ┆ bar  ┆ baz      │
# │ ---  ┆ ---  ┆ ---      │
# │ f64  ┆ f64  ┆ f64      │
# ╞══════╪══════╪══════════╡
# │ 1.0  ┆ 6.0  ┆ 1.0      │
# │ 5.0  ┆ 7.0  ┆ 3.666667 │
# │ 9.0  ┆ 9.0  ┆ 6.333333 │
# │ 10.0 ┆ null ┆ 9.0      │
# └──────┴──────┴──────────┘

Returns:



6252
6253
6254
# File 'lib/polars/data_frame.rb', line 6252

def interpolate
  select(F.col("*").interpolate)
end

#is_duplicatedSeries

Get a mask of all duplicated rows in this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 1],
    "b" => ["x", "y", "z", "x"],
  }
)
df.is_duplicated
# =>
# shape: (4,)
# Series: '' [bool]
# [
#         true
#         false
#         false
#         true
# ]

Returns:



4775
4776
4777
# File 'lib/polars/data_frame.rb', line 4775

def is_duplicated
  Utils.wrap_s(_df.is_duplicated)
end

#is_emptyBoolean Also known as: empty?

Check if the dataframe is empty.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.is_empty
# => false
df.filter(Polars.col("foo") > 99).is_empty
# => true

Returns:



6266
6267
6268
# File 'lib/polars/data_frame.rb', line 6266

def is_empty
  height == 0
end

#is_uniqueSeries

Get a mask of all unique rows in this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 1],
    "b" => ["x", "y", "z", "x"]
  }
)
df.is_unique
# =>
# shape: (4,)
# Series: '' [bool]
# [
#         false
#         true
#         true
#         false
# ]

Returns:



4800
4801
4802
# File 'lib/polars/data_frame.rb', line 4800

def is_unique
  Utils.wrap_s(_df.is_unique)
end

#item(row = nil, column = nil) ⇒ Object

Note:

If row/col not provided, this is equivalent to df[0,0], with a check that the shape is (1,1). With row/col, this is equivalent to df[row,col].

Return the DataFrame as a scalar, or return the element at the given row/column.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [4, 5, 6]})
df.select((Polars.col("a") * Polars.col("b")).sum).item
# => 32
df.item(1, 1)
# => 5
df.item(2, "b")
# => 6

Parameters:

  • row (Integer) (defaults to: nil)

    Optional row index.

  • column (Integer, String) (defaults to: nil)

    Optional column index or name.

Returns:



732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
# File 'lib/polars/data_frame.rb', line 732

def item(row = nil, column = nil)
  if row.nil? && column.nil?
    if shape != [1, 1]
      msg = (
        "can only call `.item()` if the dataframe is of shape (1, 1)," +
        " or if explicit row/col values are provided;" +
        " frame has shape #{shape.inspect}"
      )
      raise ArgumentError, msg
    end
    return _df.to_series(0).get_index(0)

  elsif row.nil? || column.nil?
    msg = "cannot call `.item()` with only one of `row` or `column`"
    raise ArgumentError, msg
  end

  s =
    if column.is_a?(Integer)
      _df.to_series(column)
    else
      _df.get_column(column)
    end
  s.get_index_signed(row)
end

#iter_columnsObject

Note:

Consider whether you can use all instead. If you can, it will be more efficient.

Returns an iterator over the columns of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.iter_columns.map { |s| s.name }
# => ["a", "b"]

If you're using this to modify a dataframe's columns, e.g.

# Do NOT do this
Polars::DataFrame.new(df.iter_columns.map { |column| column * 2 })
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 4   │
# │ 6   ┆ 8   │
# │ 10  ┆ 12  │
# └─────┴─────┘

then consider whether you can use all instead:

df.select(Polars.all * 2)
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 4   │
# │ 6   ┆ 8   │
# │ 10  ┆ 12  │
# └─────┴─────┘

Returns:



6112
6113
6114
6115
6116
6117
6118
# File 'lib/polars/data_frame.rb', line 6112

def iter_columns
  return to_enum(:iter_columns) unless block_given?

  _df.get_columns.each do |s|
    yield Utils.wrap_s(s)
  end
end

#iter_rows(named: false, buffer_size: 512, &block) ⇒ Object

Returns an iterator over the DataFrame of rows of Ruby-native values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.iter_rows.map { |row| row[0] }
# => [1, 3, 5]
df.iter_rows(named: true).map { |row| row["b"] }
# => [2, 4, 6]

Parameters:

  • named (Boolean) (defaults to: false)

    Return hashes instead of arrays. The hashes are a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

  • buffer_size (Integer) (defaults to: 512)

    Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering.

Returns:



6015
6016
6017
6018
6019
6020
6021
6022
6023
6024
6025
6026
6027
6028
6029
6030
6031
6032
6033
6034
6035
6036
6037
6038
6039
6040
6041
6042
6043
6044
6045
6046
# File 'lib/polars/data_frame.rb', line 6015

def iter_rows(named: false, buffer_size: 512, &block)
  return to_enum(:iter_rows, named: named, buffer_size: buffer_size) unless block_given?

  # load into the local namespace for a modest performance boost in the hot loops
  columns = self.columns

  # note: buffering rows results in a 2-4x speedup over individual calls
  # to ".row(i)", so it should only be disabled in extremely specific cases.
  if buffer_size
    offset = 0
    while offset < height
      zerocopy_slice = slice(offset, buffer_size)
      rows_chunk = zerocopy_slice.rows(named: false)
      if named
        rows_chunk.each do |row|
          yield columns.zip(row).to_h
        end
      else
        rows_chunk.each(&block)
      end
      offset += buffer_size
    end
  elsif named
    height.times do |i|
      yield columns.zip(row(i)).to_h
    end
  else
    height.times do |i|
      yield row(i)
    end
  end
end

#iter_slices(n_rows: 10_000) ⇒ Object

Returns a non-copying iterator of slices over the underlying DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => 0...17_500,
    "b" => Date.new(2023, 1, 1),
    "c" => "klmnoopqrstuvwxyz"
  },
  schema_overrides: {"a" => Polars::Int32}
)
df.iter_slices.map.with_index do |frame, idx|
  "#{frame.class.name}:[#{idx}]:#{frame.length}"
end
# => ["Polars::DataFrame:[0]:10000", "Polars::DataFrame:[1]:7500"]

Parameters:

  • n_rows (Integer) (defaults to: 10_000)

    Determines the number of rows contained in each DataFrame slice.

Returns:



6140
6141
6142
6143
6144
6145
6146
6147
6148
# File 'lib/polars/data_frame.rb', line 6140

def iter_slices(n_rows: 10_000)
  return to_enum(:iter_slices, n_rows: n_rows) unless block_given?

  offset = 0
  while offset < height
    yield slice(offset, n_rows)
    offset += n_rows
  end
end

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame

Join in SQL-like fashion.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
)
df.join(other_df, on: "ham")
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "full")
# =>
# shape: (4, 5)
# ┌──────┬──────┬──────┬───────┬───────────┐
# │ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
# │ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
# │ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
# ╞══════╪══════╪══════╪═══════╪═══════════╡
# │ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
# │ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
# │ null ┆ null ┆ null ┆ z     ┆ d         │
# │ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
# └──────┴──────┴──────┴───────┴───────────┘
df.join(other_df, on: "ham", how: "left")
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# │ 3   ┆ 8.0 ┆ c   ┆ null  │
# └─────┴─────┴─────┴───────┘
df.join(other_df, on: "ham", how: "semi")
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6.0 ┆ a   │
# │ 2   ┆ 7.0 ┆ b   │
# └─────┴─────┴─────┘
df.join(other_df, on: "ham", how: "anti")
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • other (DataFrame)

    DataFrame to join with.

  • left_on (Object) (defaults to: nil)

    Name(s) of the left join column(s).

  • right_on (Object) (defaults to: nil)

    Name(s) of the right join column(s).

  • on (Object) (defaults to: nil)

    Name(s) of the join columns in both DataFrames.

  • how ("inner", "left", "full", "semi", "anti", "cross") (defaults to: "inner")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • validate ('m:m', 'm:1', '1:m', '1:1') (defaults to: "m:m")

    Checks if join is of specified type.

    • many_to_many - “m:m”: default, does not result in checks
    • one_to_one - “1:1”: check if join keys are unique in both left and right datasets
    • one_to_many - “1:m”: check if join keys are unique in left dataset
    • many_to_one - “m:1”: check if join keys are unique in right dataset
  • nulls_equal (Boolean) (defaults to: false)

    Join on null values. By default null values will never produce matches.

  • coalesce (Boolean) (defaults to: nil)

    Coalescing behavior (merging of join columns).

    • nil: -> join specific.
    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.
  • maintain_order ('none', 'left', 'right', 'left_right', 'right_left') (defaults to: nil)

    Which DataFrame row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance Supported for inner, left, right and full joins

    • none No specific ordering is desired. The ordering might differ across Polars versions or even between different runs.
    • left Preserves the order of the left DataFrame.
    • right Preserves the order of the right DataFrame.
    • left_right First preserves the order of the left DataFrame, then the right.
    • right_left First preserves the order of the right DataFrame, then the left.

Returns:



3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
# File 'lib/polars/data_frame.rb', line 3526

def join(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  how: "inner",
  suffix: "_right",
  validate: "m:m",
  nulls_equal: false,
  coalesce: nil,
  maintain_order: nil
)
  lazy
    .join(
      other.lazy,
      left_on: left_on,
      right_on: right_on,
      on: on,
      how: how,
      suffix: suffix,
      validate: validate,
      nulls_equal: nulls_equal,
      coalesce: coalesce,
      maintain_order: maintain_order
    )
    .collect(optimizations: QueryOptFlags._eager)
end

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the asof_join key.

For each row in the left DataFrame:

  • A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
  • A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

The default is "backward".

Examples:

gdp = Polars::DataFrame.new(
  {
    "date" => [
      DateTime.new(2016, 1, 1),
      DateTime.new(2017, 1, 1),
      DateTime.new(2018, 1, 1),
      DateTime.new(2019, 1, 1),
    ],  # note record date: Jan 1st (sorted!)
    "gdp" => [4164, 4411, 4566, 4696]
  }
).set_sorted("date")
population = Polars::DataFrame.new(
  {
    "date" => [
      DateTime.new(2016, 5, 12),
      DateTime.new(2017, 5, 12),
      DateTime.new(2018, 5, 12),
      DateTime.new(2019, 5, 12),
    ],  # note record date: May 12th (sorted!)
    "population" => [82.19, 82.66, 83.12, 83.52]
  }
).set_sorted("date")
population.join_asof(
  gdp, left_on: "date", right_on: "date", strategy: "backward"
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬──────┐
# │ date                ┆ population ┆ gdp  │
# │ ---                 ┆ ---        ┆ ---  │
# │ datetime[ns]        ┆ f64        ┆ i64  │
# ╞═════════════════════╪════════════╪══════╡
# │ 2016-05-12 00:00:00 ┆ 82.19      ┆ 4164 │
# │ 2017-05-12 00:00:00 ┆ 82.66      ┆ 4411 │
# │ 2018-05-12 00:00:00 ┆ 83.12      ┆ 4566 │
# │ 2019-05-12 00:00:00 ┆ 83.52      ┆ 4696 │
# └─────────────────────┴────────────┴──────┘

Parameters:

  • other (DataFrame)

    DataFrame to join with.

  • left_on (String) (defaults to: nil)

    Join column of the left DataFrame.

  • right_on (String) (defaults to: nil)

    Join column of the right DataFrame.

  • on (String) (defaults to: nil)

    Join column of both DataFrames. If set, left_on and right_on should be nil.

  • by_left (Object) (defaults to: nil)

    join on these columns before doing asof join

  • by_right (Object) (defaults to: nil)

    join on these columns before doing asof join

  • by (Object) (defaults to: nil)

    join on these columns before doing asof join

  • strategy ("backward", "forward") (defaults to: "backward")

    Join strategy.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

  • tolerance (Object) (defaults to: nil)

    Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time" you use the following string language:

    • 1ns (1 nanosecond)
    • 1us (1 microsecond)
    • 1ms (1 millisecond)
    • 1s (1 second)
    • 1m (1 minute)
    • 1h (1 hour)
    • 1d (1 day)
    • 1w (1 week)
    • 1mo (1 calendar month)
    • 1y (1 calendar year)
    • 1i (1 index count)

    Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

  • allow_parallel (Boolean) (defaults to: true)

    Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

  • force_parallel (Boolean) (defaults to: false)

    Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

  • coalesce (Boolean) (defaults to: true)

    Coalescing behavior (merging of join columns).

    • true: -> Always coalesce join columns.
    • false: -> Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.
  • allow_exact_matches (Boolean) (defaults to: true)

    Whether exact matches are valid join predicates.

    • If true, allow matching with the same on value (i.e. less-than-or-equal-to / greater-than-or-equal-to).
    • If false, don't match the same on value (i.e., strictly less-than / strictly greater-than).
  • check_sortedness (Boolean) (defaults to: true)

    Check the sortedness of the asof keys. If the keys are not sorted Polars will error, or in case of 'by' argument raise a warning. This might become a hard error in the future.

Returns:



3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
# File 'lib/polars/data_frame.rb', line 3360

def join_asof(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  by_left: nil,
  by_right: nil,
  by: nil,
  strategy: "backward",
  suffix: "_right",
  tolerance: nil,
  allow_parallel: true,
  force_parallel: false,
  coalesce: true,
  allow_exact_matches: true,
  check_sortedness: true
)
  lazy
    .join_asof(
      other.lazy,
      left_on: left_on,
      right_on: right_on,
      on: on,
      by_left: by_left,
      by_right: by_right,
      by: by,
      strategy: strategy,
      suffix: suffix,
      tolerance: tolerance,
      allow_parallel: allow_parallel,
      force_parallel: force_parallel,
      coalesce: coalesce,
      allow_exact_matches: allow_exact_matches,
      check_sortedness: check_sortedness
    )
    .collect(optimizations: QueryOptFlags._eager)
end

#join_where(other, *predicates, suffix: "_right") ⇒ DataFrame

Note:

The row order of the input DataFrames is not preserved.

Note:

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Perform a join based on one or multiple (in)equality predicates.

This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.

Examples:

Join two dataframes together based on two predicates which get AND-ed together.

east = Polars::DataFrame.new(
  {
    "id": [100, 101, 102],
    "dur": [120, 140, 160],
    "rev": [12, 14, 16],
    "cores": [2, 8, 4]
  }
)
west = Polars::DataFrame.new(
  {
    "t_id": [404, 498, 676, 742],
    "time": [90, 130, 150, 170],
    "cost": [9, 13, 15, 16],
    "cores": [4, 2, 1, 4]
  }
)
east.join_where(
  west,
  Polars.col("dur") < Polars.col("time"),
  Polars.col("rev") < Polars.col("cost")
)
# =>
# shape: (5, 8)
# ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
# │ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
# │ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
# │ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
# ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

To OR them together, use a single expression and the | operator.

east.join_where(
  west,
  (Polars.col("dur") < Polars.col("time")) | (Polars.col("rev") < Polars.col("cost"))
)
# =>
# shape: (6, 8)
# ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐
# │ id  ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │
# │ --- ┆ --- ┆ --- ┆ ---   ┆ ---  ┆ ---  ┆ ---  ┆ ---         │
# │ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64  ┆ i64  ┆ i64  ┆ i64         │
# ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 498  ┆ 130  ┆ 13   ┆ 2           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 100 ┆ 120 ┆ 12  ┆ 2     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 676  ┆ 150  ┆ 15   ┆ 1           │
# │ 101 ┆ 140 ┆ 14  ┆ 8     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# │ 102 ┆ 160 ┆ 16  ┆ 4     ┆ 742  ┆ 170  ┆ 16   ┆ 4           │
# └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘

Parameters:

  • other (DataFrame)

    DataFrame to join with.

  • predicates (Array)

    (In)Equality condition to join the two tables on. When a column name occurs in both tables, the proper suffix must be applied in the predicate.

  • suffix (String) (defaults to: "_right")

    Suffix to append to columns with a duplicate name.

Returns:



3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
# File 'lib/polars/data_frame.rb', line 3633

def join_where(
  other,
  *predicates,
  suffix: "_right"
)
  Utils.require_same_type(self, other)

  lazy
  .join_where(
    other.lazy,
    *predicates,
    suffix: suffix
  )
  .collect(optimizations: QueryOptFlags._eager)
end

#lazyLazyFrame

Start a lazy query from this point.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.lazy

Returns:



4817
4818
4819
# File 'lib/polars/data_frame.rb', line 4817

def lazy
  wrap_ldf(_df.lazy)
end

#limit(n = 5) ⇒ DataFrame

Get the first n rows.

Alias for #head.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3, 4, 5, 6], "bar" => ["a", "b", "c", "d", "e", "f"]}
)
df.limit(4)
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ a   │
# │ 2   ┆ b   │
# │ 3   ┆ c   │
# │ 4   ┆ d   │
# └─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



2515
2516
2517
# File 'lib/polars/data_frame.rb', line 2515

def limit(n = 5)
  head(n)
end

#map_rows(return_dtype: nil, inference_size: 256, &function) ⇒ Object

Note:

The frame-level apply cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level apply syntax instead.

Apply a custom/user-defined function (UDF) over the rows of the DataFrame.

The UDF will receive each row as a tuple of values: udf(row).

Implementing logic using a Ruby function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:

  • The native expression engine runs in Rust; UDFs run in Ruby.
  • Use of Ruby UDFs forces the DataFrame to be materialized in memory.
  • Polars-native expressions can be parallelised (UDFs cannot).
  • Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [-1, 5, 8]})

Return a DataFrame by mapping each row to a tuple:

df.map_rows { |t| [t[0] * 2, t[1] * 3] }
# =>
# shape: (3, 2)
# ┌──────────┬──────────┐
# │ column_0 ┆ column_1 │
# │ ---      ┆ ---      │
# │ i64      ┆ i64      │
# ╞══════════╪══════════╡
# │ 2        ┆ -3       │
# │ 4        ┆ 15       │
# │ 6        ┆ 24       │
# └──────────┴──────────┘

Return a Series by mapping each row to a scalar:

df.map_rows { |t| t[0] * 2 + t[1] }
# =>
# shape: (3, 1)
# ┌─────┐
# │ map │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 9   │
# │ 14  │
# └─────┘

Parameters:

  • return_dtype (Symbol) (defaults to: nil)

    Output type of the operation. If none given, Polars tries to infer the type.

  • inference_size (Integer) (defaults to: 256)

    Only used in the case when the custom function returns rows. This uses the first n rows to determine the output schema

Returns:



3709
3710
3711
3712
3713
3714
3715
3716
# File 'lib/polars/data_frame.rb', line 3709

def map_rows(return_dtype: nil, inference_size: 256, &function)
  out, is_df = _df.map_rows(function, return_dtype, inference_size)
  if is_df
    _from_rbdf(out)
  else
    _from_rbdf(Utils.wrap_s(out).to_frame._df)
  end
end

#maxDataFrame

Aggregate the columns of this DataFrame to their maximum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.max
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Returns:



5122
5123
5124
# File 'lib/polars/data_frame.rb', line 5122

def max
  lazy.max.collect(optimizations: QueryOptFlags._eager)
end

#max_horizontalSeries

Get the maximum value horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.max_horizontal
# =>
# shape: (3,)
# Series: 'max' [f64]
# [
#         4.0
#         5.0
#         6.0
# ]

Returns:



5146
5147
5148
# File 'lib/polars/data_frame.rb', line 5146

def max_horizontal
  select(max: F.max_horizontal(F.all)).to_series
end

#meanDataFrame

Aggregate the columns of this DataFrame to their mean value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.mean
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘

Returns:



5278
5279
5280
# File 'lib/polars/data_frame.rb', line 5278

def mean
  lazy.mean.collect(optimizations: QueryOptFlags._eager)
end

#mean_horizontal(ignore_nulls: true) ⇒ Series

Take the mean of all values horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.mean_horizontal
# =>
# shape: (3,)
# Series: 'mean' [f64]
# [
#         2.5
#         3.5
#         4.5
# ]

Parameters:

  • ignore_nulls (Boolean) (defaults to: true)

    Ignore null values (default). If set to false, any null value in the input will lead to a null output.

Returns:



5306
5307
5308
5309
5310
# File 'lib/polars/data_frame.rb', line 5306

def mean_horizontal(ignore_nulls: true)
  select(
    mean: F.mean_horizontal(F.all, ignore_nulls: ignore_nulls)
  ).to_series
end

#medianDataFrame

Aggregate the columns of this DataFrame to their median value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.median
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘

Returns:



5416
5417
5418
# File 'lib/polars/data_frame.rb', line 5416

def median
  lazy.median.collect(optimizations: QueryOptFlags._eager)
end

#merge_sorted(other, key) ⇒ DataFrame

Take two sorted DataFrames and merge them by the sorted key.

The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

The schemas of both DataFrames must be equal.

Examples:

df0 = Polars::DataFrame.new(
  {"name" => ["steve", "elise", "bob"], "age" => [42, 44, 18]}
).sort("age")
df1 = Polars::DataFrame.new(
  {"name" => ["anna", "megan", "steve", "thomas"], "age" => [21, 33, 42, 20]}
).sort("age")
df0.merge_sorted(df1, "age")
# =>
# shape: (7, 2)
# ┌────────┬─────┐
# │ name   ┆ age │
# │ ---    ┆ --- │
# │ str    ┆ i64 │
# ╞════════╪═════╡
# │ bob    ┆ 18  │
# │ thomas ┆ 20  │
# │ anna   ┆ 21  │
# │ megan  ┆ 33  │
# │ steve  ┆ 42  │
# │ steve  ┆ 42  │
# │ elise  ┆ 44  │
# └────────┴─────┘

Parameters:

  • other (DataFrame)

    Other DataFrame that must be merged

  • key (String)

    Key that is sorted.

Returns:



6383
6384
6385
# File 'lib/polars/data_frame.rb', line 6383

def merge_sorted(other, key)
  lazy.merge_sorted(other.lazy, key).collect(optimizations: QueryOptFlags._eager)
end

#minDataFrame

Aggregate the columns of this DataFrame to their minimum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.min
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Returns:



5172
5173
5174
# File 'lib/polars/data_frame.rb', line 5172

def min
  lazy.min.collect(optimizations: QueryOptFlags._eager)
end

#min_horizontalSeries

Get the minimum value horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.min_horizontal
# =>
# shape: (3,)
# Series: 'min' [f64]
# [
#         1.0
#         2.0
#         3.0
# ]

Returns:



5196
5197
5198
# File 'lib/polars/data_frame.rb', line 5196

def min_horizontal
  select(min: F.min_horizontal(F.all)).to_series
end

#n_chunks(strategy: "first") ⇒ Object

Get number of chunks used by the ChunkedArrays of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.n_chunks
# => 1
df.n_chunks(strategy: "all")
# => [1, 1, 1]

Parameters:

  • strategy ("first", "all") (defaults to: "first")

    Return the number of chunks of the 'first' column, or 'all' columns in this DataFrame.

Returns:



5090
5091
5092
5093
5094
5095
5096
5097
5098
# File 'lib/polars/data_frame.rb', line 5090

def n_chunks(strategy: "first")
  if strategy == "first"
    _df.n_chunks
  elsif strategy == "all"
    get_columns.map(&:n_chunks)
  else
    raise ArgumentError, "Strategy: '{strategy}' not understood. Choose one of {{'first',  'all'}}"
  end
end

#n_unique(subset: nil) ⇒ DataFrame

Return the number of unique rows, or the number of unique row-subsets.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 1, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
    "c" => [true, true, true, false, true, true]
  }
)
df.n_unique
# => 5

Simple columns subset

df.n_unique(subset: ["b", "c"])
# => 4

Expression subset

df.n_unique(
  subset: [
    (Polars.col("a").floordiv(2)),
    (Polars.col("c") | (Polars.col("b") >= 2))
  ]
)
# => 3

Parameters:

  • subset (Object) (defaults to: nil)

    One or more columns/expressions that define what to count; omit to return the count of unique rows.

Returns:



5595
5596
5597
5598
5599
5600
5601
5602
5603
5604
5605
5606
5607
5608
5609
5610
5611
# File 'lib/polars/data_frame.rb', line 5595

def n_unique(subset: nil)
  if subset.is_a?(StringIO)
    subset = [Polars.col(subset)]
  elsif subset.is_a?(Expr)
    subset = [subset]
  end

  if subset.is_a?(::Array) && subset.length == 1
    expr = Utils.wrap_expr(Utils.parse_into_expression(subset[0], str_as_lit: false))
  else
    struct_fields = subset.nil? ? Polars.all : subset
    expr = Polars.struct(struct_fields)
  end

  df = lazy.select(expr.n_unique).collect
  df.is_empty ? 0 : df.row(0)[0]
end

#null_countDataFrame

Create a new DataFrame that shows the null counts per column.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 3],
    "bar" => [6, 7, nil],
    "ham" => ["a", "b", "c"]
  }
)
df.null_count
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 1   ┆ 0   │
# └─────┴─────┴─────┘

Returns:



5645
5646
5647
# File 'lib/polars/data_frame.rb', line 5645

def null_count
  _from_rbdf(_df.null_count)
end

#partition_by(by, *more_by, maintain_order: true, include_key: true, as_dict: false) ⇒ Object

Split into multiple DataFrames partitioned by groups.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => ["A", "A", "B", "B", "C"],
    "N" => [1, 2, 2, 4, 2],
    "bar" => ["k", "l", "m", "m", "l"]
  }
)
df.partition_by("foo", maintain_order: true)
# =>
# [shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ A   ┆ 1   ┆ k   │
# │ A   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘, shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ B   ┆ 2   ┆ m   │
# │ B   ┆ 4   ┆ m   │
# └─────┴─────┴─────┘, shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ C   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘]
df.partition_by("foo", maintain_order: true, as_dict: true)
# =>
# {["A"]=>shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ A   ┆ 1   ┆ k   │
# │ A   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘, ["B"]=>shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ B   ┆ 2   ┆ m   │
# │ B   ┆ 4   ┆ m   │
# └─────┴─────┴─────┘, ["C"]=>shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ N   ┆ bar │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ C   ┆ 2   ┆ l   │
# └─────┴─────┴─────┘}

Parameters:

  • by (Object)

    Groups to partition by.

  • more_by (Array)

    Additional names of columns to group by, specified as positional arguments.

  • maintain_order (Boolean) (defaults to: true)

    Keep predictable output order. This is slower as it requires an extra sort operation.

  • include_key (Boolean) (defaults to: true)

    Include the columns used to partition the DataFrame in the output.

  • as_dict (Boolean) (defaults to: false)

    If true, return the partitions in a hash keyed by the distinct group values instead of an array.

Returns:



4685
4686
4687
4688
4689
4690
4691
4692
4693
4694
4695
4696
4697
4698
4699
4700
4701
4702
4703
4704
4705
# File 'lib/polars/data_frame.rb', line 4685

def partition_by(by, *more_by, maintain_order: true, include_key: true, as_dict: false)
  by_parsed = Utils._expand_selectors(self, by, *more_by)

  partitions = _df.partition_by(by_parsed, maintain_order, include_key).map { |df| _from_rbdf(df) }

  if as_dict
    if include_key
      names = partitions.map { |p| p.select(by_parsed).row(0) }
    else
      if !maintain_order
        msg = "cannot use `partition_by` with `maintain_order: false, include_key: false, as_dict: true`"
        raise ArgumentError, msg
      end
      names = select(by_parsed).unique(maintain_order: true).rows
    end

    return names.zip(partitions).to_h
  end

  partitions
end

#pipe(function, *args, **kwargs, &block) ⇒ Object

Note:

It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See #lazy.

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Examples:

cast_str_to_int = lambda do |data, col_name:|
  data.with_columns(Polars.col(col_name).cast(Polars::Int64))
end

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => ["10", "20", "30", "40"]})
df.pipe(cast_str_to_int, col_name: "b")
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 10  │
# │ 2   ┆ 20  │
# │ 3   ┆ 30  │
# │ 4   ┆ 40  │
# └─────┴─────┘

Parameters:

  • function (Object)

    Callable; will receive the frame as the first parameter, followed by any given args/kwargs.

  • args (Object)

    Arguments to pass to the UDF.

  • kwargs (Object)

    Keyword arguments to pass to the UDF.

Returns:



2708
2709
2710
# File 'lib/polars/data_frame.rb', line 2708

def pipe(function, *args, **kwargs, &block)
  function.(self, *args, **kwargs, &block)
end

#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame

Create a spreadsheet-style pivot table as a DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => ["one", "one", "two", "two", "one", "two"],
    "bar" => ["y", "y", "y", "x", "x", "x"],
    "baz" => [1, 2, 3, 4, 5, 6]
  }
)
df.pivot("bar", index: "foo", values: "baz", aggregate_function: "sum")
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ y   ┆ x   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ one ┆ 3   ┆ 5   │
# │ two ┆ 3   ┆ 10  │
# └─────┴─────┴─────┘

Parameters:

  • on (Object)

    Columns whose values will be used as the header of the output DataFrame

  • index (Object) (defaults to: nil)

    One or multiple keys to group by

  • values (Object) (defaults to: nil)

    Column values to aggregate. Can be multiple columns if the columns arguments contains multiple columns as well

  • aggregate_function ("first", "sum", "max", "min", "mean", "median", "last", "count") (defaults to: nil)

    A predefined aggregate function str or an expression.

  • maintain_order (Object) (defaults to: true)

    Sort the grouped keys so that the output order is predictable.

  • sort_columns (Object) (defaults to: false)

    Sort the transposed columns by name. Default is by order of discovery.

  • separator (String) (defaults to: "_")

    Used as separator/delimiter in generated column names in case of multiple values columns.

Returns:



4373
4374
4375
4376
4377
4378
4379
4380
4381
4382
4383
4384
4385
4386
4387
4388
4389
4390
4391
4392
4393
4394
4395
4396
4397
4398
4399
4400
4401
4402
4403
4404
4405
4406
4407
4408
4409
4410
4411
4412
4413
4414
4415
4416
4417
4418
4419
4420
4421
4422
4423
4424
4425
4426
4427
4428
4429
# File 'lib/polars/data_frame.rb', line 4373

def pivot(
  on,
  index: nil,
  values: nil,
  aggregate_function: nil,
  maintain_order: true,
  sort_columns: false,
  separator: "_"
)
  index = Utils._expand_selectors(self, index)
  on = Utils._expand_selectors(self, on)
  if !values.nil?
    values = Utils._expand_selectors(self, values)
  end

  if aggregate_function.is_a?(::String)
    case aggregate_function
    when "first"
      aggregate_expr = F.element.first._rbexpr
    when "sum"
      aggregate_expr = F.element.sum._rbexpr
    when "max"
      aggregate_expr = F.element.max._rbexpr
    when "min"
      aggregate_expr = F.element.min._rbexpr
    when "mean"
      aggregate_expr = F.element.mean._rbexpr
    when "median"
      aggregate_expr = F.element.median._rbexpr
    when "last"
      aggregate_expr = F.element.last._rbexpr
    when "len"
      aggregate_expr = F.len._rbexpr
    when "count"
      warn "`aggregate_function: \"count\"` input for `pivot` is deprecated. Use `aggregate_function: \"len\"` instead."
      aggregate_expr = F.len._rbexpr
    else
      raise ArgumentError, "Argument aggregate fn: '#{aggregate_fn}' was not expected."
    end
  elsif aggregate_function.nil?
    aggregate_expr = nil
  else
    aggregate_expr = aggregate_function._rbexpr
  end

  _from_rbdf(
    _df.pivot_expr(
      on,
      index,
      values,
      maintain_order,
      sort_columns,
      aggregate_expr,
      separator
    )
  )
end

#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Object

Plot data.

Returns:

Raises:

  • (ArgumentError)


120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# File 'lib/polars/data_frame.rb', line 120

def plot(x = nil, y = nil, type: nil, group: nil, stacked: nil)
  plot = DataFramePlot.new(self)
  return plot if x.nil? && y.nil?

  raise ArgumentError, "Must specify columns" if x.nil? || y.nil?
  type ||= begin
    if self[x].dtype.numeric? && self[y].dtype.numeric?
      "scatter"
    elsif self[x].dtype == String && self[y].dtype.numeric?
      "column"
    elsif (self[x].dtype == Date || self[x].dtype == Datetime) && self[y].dtype.numeric?
      "line"
    else
      raise "Cannot determine type. Use the type option."
    end
  end

  case type
  when "line"
    plot.line(x, y, color: group)
  when "area"
    plot.area(x, y, color: group)
  when "pie"
    raise ArgumentError, "Cannot use group option with pie chart" unless group.nil?
    plot.pie(x, y)
  when "column"
    plot.column(x, y, color: group, stacked: stacked)
  when "bar"
    plot.bar(x, y, color: group, stacked: stacked)
  when "scatter"
    plot.scatter(x, y, color: group)
  else
    raise ArgumentError, "Invalid type: #{type}"
  end
end

#productDataFrame

Aggregate the columns of this DataFrame to their product values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3],
    "b" => [0.5, 4, 10],
    "c" => [true, true, false]
  }
)
df.product
# =>
# shape: (1, 3)
# ┌─────┬──────┬─────┐
# │ a   ┆ b    ┆ c   │
# │ --- ┆ ---  ┆ --- │
# │ i64 ┆ f64  ┆ i64 │
# ╞═════╪══════╪═════╡
# │ 6   ┆ 20.0 ┆ 0   │
# └─────┴──────┴─────┘

Returns:



5442
5443
5444
# File 'lib/polars/data_frame.rb', line 5442

def product
  select(Polars.all.product)
end

#quantile(quantile, interpolation: "nearest") ⇒ DataFrame

Aggregate the columns of this DataFrame to their quantile value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.quantile(0.5, interpolation: "nearest")
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 2.0 ┆ 7.0 ┆ null │
# └─────┴─────┴──────┘

Parameters:

  • quantile (Float)

    Quantile between 0.0 and 1.0.

  • interpolation ("nearest", "higher", "lower", "midpoint", "linear") (defaults to: "nearest")

    Interpolation method.

Returns:



5473
5474
5475
# File 'lib/polars/data_frame.rb', line 5473

def quantile(quantile, interpolation: "nearest")
  lazy.quantile(quantile, interpolation: interpolation).collect(optimizations: QueryOptFlags._eager)
end

#rechunkDataFrame

This will make sure all subsequent operations have optimal and predictable performance.

Returns:



5619
5620
5621
# File 'lib/polars/data_frame.rb', line 5619

def rechunk
  _from_rbdf(_df.rechunk)
end

#remove(*predicates, **constraints) ⇒ DataFrame

Remove rows, dropping those that match the given predicate expression(s).

The original order of the remaining rows is preserved.

Rows where the filter predicate does not evaluate to true are retained (this includes rows where the predicate evaluates as null).

Examples:

Remove rows matching a condition:

df = Polars::DataFrame.new(
  {
    "foo" => [2, 3, nil, 4, 0],
    "bar" => [5, 6, nil, nil, 0],
    "ham" => ["a", "b", nil, "c", "d"]
  }
)
df.remove(Polars.col("bar") >= 5)
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# │ 0    ┆ 0    ┆ d    │
# └──────┴──────┴──────┘

Discard rows based on multiple conditions, combined with and/or operators:

df.remove(
  (Polars.col("foo") >= 0) & (Polars.col("bar") >= 0),
)
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘
df.remove(
  (Polars.col("foo") >= 0) | (Polars.col("bar") >= 0),
)
# =>
# shape: (1, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Provide multiple constraints using *args syntax:

df.remove(
  Polars.col("ham").is_not_null,
  Polars.col("bar") >= 0
)
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘

Provide constraints(s) using **kwargs syntax:

df.remove(foo: 0, bar: 0)
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 2    ┆ 5    ┆ a    │
# │ 3    ┆ 6    ┆ b    │
# │ null ┆ null ┆ null │
# │ 4    ┆ null ┆ c    │
# └──────┴──────┴──────┘

Remove rows by comparing two columns against each other:

df.remove(
  Polars.col("foo").ne_missing(Polars.col("bar"))
)
# =>
# shape: (2, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 0    ┆ 0    ┆ d    │
# └──────┴──────┴──────┘

Parameters:

  • predicates (Array)

    Expression that evaluates to a boolean Series.

  • constraints (Hash)

    Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as Polars.col(name).eq(value), and is implicitly joined with the other filter conditions using &.

Returns:



1878
1879
1880
1881
1882
1883
1884
1885
# File 'lib/polars/data_frame.rb', line 1878

def remove(
  *predicates,
  **constraints
)
  lazy
  .remove(*predicates, **constraints)
  .collect(optimizations: QueryOptFlags._eager)
end

#rename(mapping, strict: true) ⇒ DataFrame

Rename column names.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.rename({"foo" => "apple"})
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 1     ┆ 6   ┆ a   │
# │ 2     ┆ 7   ┆ b   │
# │ 3     ┆ 8   ┆ c   │
# └───────┴─────┴─────┘

Parameters:

  • mapping (Hash)

    Key value pairs that map from old name to new name.

  • strict (Boolean) (defaults to: true)

    Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).

Returns:



1662
1663
1664
# File 'lib/polars/data_frame.rb', line 1662

def rename(mapping, strict: true)
  lazy.rename(mapping, strict: strict).collect(optimizations: QueryOptFlags._eager)
end

#replace_column(index, column) ⇒ DataFrame

Replace a column at an index location.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
s = Polars::Series.new("apple", [10, 20, 30])
df.replace_column(0, s)
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ apple ┆ bar ┆ ham │
# │ ---   ┆ --- ┆ --- │
# │ i64   ┆ i64 ┆ str │
# ╞═══════╪═════╪═════╡
# │ 10    ┆ 6   ┆ a   │
# │ 20    ┆ 7   ┆ b   │
# │ 30    ┆ 8   ┆ c   │
# └───────┴─────┴─────┘

Parameters:

  • index (Integer)

    Column index.

  • column (Series)

    Series that will replace the column.

Returns:



2105
2106
2107
2108
2109
2110
2111
# File 'lib/polars/data_frame.rb', line 2105

def replace_column(index, column)
  if index < 0
    index = width + index
  end
  _df.replace_column(index, column._s)
  self
end

#reverseDataFrame

Reverse the DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "key" => ["a", "b", "c"],
    "val" => [1, 2, 3]
  }
)
df.reverse
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ key ┆ val │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ c   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 1   │
# └─────┴─────┘

Returns:



1627
1628
1629
# File 'lib/polars/data_frame.rb', line 1627

def reverse
  select(Polars.col("*").reverse)
end

#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ RollingGroupBy

Create rolling groups based on a time column.

Different from a dynamic_group_by the windows are now determined by the individual values and are not of constant intervals. For constant intervals use group_by_dynamic

The period and offset arguments are created either from a timedelta, or by using the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_rolling on an integer column, the windows are defined by:

  • "1i" # length 1
  • "10i" # length 10

Examples:

dates = [
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
]
df = Polars::DataFrame.new({"dt" => dates, "a" => [3, 7, 5, 9, 2, 1]}).with_columns(
  Polars.col("dt").str.strptime(Polars::Datetime).set_sorted
)
df.rolling(index_column: "dt", period: "2d").agg(
  [
    Polars.sum("a").alias("sum_a"),
    Polars.min("a").alias("min_a"),
    Polars.max("a").alias("max_a")
  ]
)
# =>
# shape: (6, 4)
# ┌─────────────────────┬───────┬───────┬───────┐
# │ dt                  ┆ sum_a ┆ min_a ┆ max_a │
# │ ---                 ┆ ---   ┆ ---   ┆ ---   │
# │ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
# ╞═════════════════════╪═══════╪═══════╪═══════╡
# │ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
# │ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
# │ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
# │ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
# │ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
# │ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
# └─────────────────────┴───────┴───────┴───────┘

Parameters:

  • index_column (Object)

    Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

    In case of a rolling operation on indices, dtype needs to be one of \{UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.

  • period (Object)

    Length of the window.

  • offset (Object) (defaults to: nil)

    Offset of the window. Default is -period.

  • closed ("right", "left", "both", "none") (defaults to: "right")

    Define whether the temporal window interval is closed or not.

  • group_by (Object) (defaults to: nil)

    Also group by this column/these columns.

Returns:



2875
2876
2877
2878
2879
2880
2881
2882
2883
# File 'lib/polars/data_frame.rb', line 2875

def rolling(
  index_column:,
  period:,
  offset: nil,
  closed: "right",
  group_by: nil
)
  RollingGroupBy.new(self, index_column, period, offset, closed, group_by)
end

#row(index = nil, by_predicate: nil, named: false) ⇒ Object

Note:

The index and by_predicate params are mutually exclusive. Additionally, to ensure clarity, the by_predicate parameter must be supplied by keyword.

When using by_predicate it is an error condition if anything other than one row is returned; more than one row raises TooManyRowsReturned, and zero rows will raise NoRowsReturned (both inherit from RowsException).

Get a row as tuple, either by index or by predicate.

Examples:

Return the row at the given index

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.row(2)
# => [3, 8, "c"]

Get a hash instead with a mapping of column names to row values

df.row(2, named: true)
# => {"foo"=>3, "bar"=>8, "ham"=>"c"}

Return the row that matches the given predicate

df.row(by_predicate: Polars.col("ham") == "b")
# => [2, 7, "b"]

Parameters:

  • index (Object) (defaults to: nil)

    Row index.

  • by_predicate (Object) (defaults to: nil)

    Select the row according to a given expression/predicate.

  • named (Boolean) (defaults to: false)

    Return a hash instead of an array. The hash is a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

Returns:



5840
5841
5842
5843
5844
5845
5846
5847
5848
5849
5850
5851
5852
5853
5854
5855
5856
5857
5858
5859
5860
5861
5862
5863
5864
5865
5866
5867
5868
5869
5870
5871
5872
5873
5874
# File 'lib/polars/data_frame.rb', line 5840

def row(index = nil, by_predicate: nil, named: false)
  if !index.nil? && !by_predicate.nil?
    raise ArgumentError, "Cannot set both 'index' and 'by_predicate'; mutually exclusive"
  elsif index.is_a?(Expr)
    raise TypeError, "Expressions should be passed to the 'by_predicate' param"
  end

  if !index.nil?
    row = _df.row_tuple(index)
    if named
      columns.zip(row).to_h
    else
      row
    end
  elsif !by_predicate.nil?
    if !by_predicate.is_a?(Expr)
      raise TypeError, "Expected by_predicate to be an expression; found #{by_predicate.class.name}"
    end
    rows = filter(by_predicate).rows
    n_rows = rows.length
    if n_rows > 1
      raise TooManyRowsReturned, "Predicate #{by_predicate} returned #{n_rows} rows"
    elsif n_rows == 0
      raise NoRowsReturned, "Predicate #{by_predicate} returned no rows"
    end
    row = rows[0]
    if named
      columns.zip(row).to_h
    else
      row
    end
  else
    raise ArgumentError, "One of 'index' or 'by_predicate' must be set"
  end
end

#rows(named: false) ⇒ Array

Convert columnar data to rows as Ruby arrays.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.rows
# => [[1, 2], [3, 4], [5, 6]]
df.rows(named: true)
# => [{"a"=>1, "b"=>2}, {"a"=>3, "b"=>4}, {"a"=>5, "b"=>6}]

Parameters:

  • named (Boolean) (defaults to: false)

    Return hashes instead of arrays. The hashes are a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

Returns:



5897
5898
5899
5900
5901
5902
5903
5904
5905
5906
# File 'lib/polars/data_frame.rb', line 5897

def rows(named: false)
  if named
    columns = self.columns
    _df.row_tuples.map do |v|
      columns.zip(v).to_h
    end
  else
    _df.row_tuples
  end
end

#rows_by_key(key, named: false, include_key: false, unique: false) ⇒ Hash

Convert columnar data to rows as Ruby arrays in a hash keyed by some column.

This method is like rows, but instead of returning rows in a flat list, rows are grouped by the values in the key column(s) and returned as a hash.

Note that this method should not be used in place of native operations, due to the high cost of materializing all frame data out into a hash; it should be used only when you need to move the values out into a Ruby data structure or other object that cannot operate directly with Polars/Arrow.

Examples:

Group rows by the given key column(s):

df = Polars::DataFrame.new(
  {
    "w" => ["a", "b", "b", "a"],
    "x" => ["q", "q", "q", "k"],
    "y" => [1.0, 2.5, 3.0, 4.5],
    "z" => [9, 8, 7, 6]
  }
)
df.rows_by_key(["w"])
# => {"a"=>[["q", 1.0, 9], ["k", 4.5, 6]], "b"=>[["q", 2.5, 8], ["q", 3.0, 7]]}

Return the same row groupings as hashes:

df.rows_by_key(["w"], named: true)
# => {"a"=>[{"x"=>"q", "y"=>1.0, "z"=>9}, {"x"=>"k", "y"=>4.5, "z"=>6}], "b"=>[{"x"=>"q", "y"=>2.5, "z"=>8}, {"x"=>"q", "y"=>3.0, "z"=>7}]}

Return row groupings, assuming keys are unique:

df.rows_by_key(["z"], unique: true)
# => {9=>["a", "q", 1.0], 8=>["b", "q", 2.5], 7=>["b", "q", 3.0], 6=>["a", "k", 4.5]}

Return row groupings as hashes, assuming keys are unique:

df.rows_by_key(["z"], named: true, unique: true)
# => {9=>{"w"=>"a", "x"=>"q", "y"=>1.0}, 8=>{"w"=>"b", "x"=>"q", "y"=>2.5}, 7=>{"w"=>"b", "x"=>"q", "y"=>3.0}, 6=>{"w"=>"a", "x"=>"k", "y"=>4.5}}

Return hash rows grouped by a compound key, including key values:

df.rows_by_key(["w", "x"], named: true, include_key: true)
# => {["a", "q"]=>[{"w"=>"a", "x"=>"q", "y"=>1.0, "z"=>9}], ["b", "q"]=>[{"w"=>"b", "x"=>"q", "y"=>2.5, "z"=>8}, {"w"=>"b", "x"=>"q", "y"=>3.0, "z"=>7}], ["a", "k"]=>[{"w"=>"a", "x"=>"k", "y"=>4.5, "z"=>6}]}

Parameters:

  • key (Object)

    The column(s) to use as the key for the returned hash. If multiple columns are specified, the key will be a tuple of those values, otherwise it will be a string.

  • named (Boolean) (defaults to: false)

    Return hashes instead of arrays. The hashes are a mapping of column name to row value. This is more expensive than returning an array, but allows for accessing values by column name.

  • include_key (Boolean) (defaults to: false)

    Include key values inline with the associated data (by default the key values are omitted as a memory/performance optimisation, as they can be reoconstructed from the key).

  • unique (Boolean) (defaults to: false)

    Indicate that the key is unique; this will result in a 1:1 mapping from key to a single associated row. Note that if the key is not actually unique the last row with the given key will be returned.

Returns:

  • (Hash)


5964
5965
5966
5967
5968
5969
5970
5971
5972
5973
5974
5975
5976
5977
5978
5979
5980
5981
5982
5983
5984
5985
# File 'lib/polars/data_frame.rb', line 5964

def rows_by_key(key, named: false, include_key: false, unique: false)
  key = Utils._expand_selectors(self, key)

  keys = key.size == 1 ? get_column(key[0]) : select(key).iter_rows

  if include_key
    values = self
  else
    data_cols = schema.names - key
    values = select(data_cols)
  end

  zipped = keys.each.zip(values.iter_rows(named: named))

  # if unique, we expect to write just one entry per key; otherwise, we're
  # returning a list of rows for each key, so append into a hash of arrays.
  if unique
    zipped.to_h
  else
    zipped.each_with_object({}) { |(key, data), h| (h[key] ||= []) << data }
  end
end

#sample(n: nil, fraction: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame

Sample from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.sample(n: 2, seed: 0)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Parameters:

  • n (Integer) (defaults to: nil)

    Number of items to return. Cannot be used with fraction. Defaults to 1 if fraction is nil.

  • fraction (Float) (defaults to: nil)

    Fraction of items to return. Cannot be used with n.

  • with_replacement (Boolean) (defaults to: false)

    Allow values to be sampled more than once.

  • shuffle (Boolean) (defaults to: false)

    Shuffle the order of sampled data points.

  • seed (Integer) (defaults to: nil)

    Seed for the random number generator. If set to nil (default), a random seed is used.

Returns:



5685
5686
5687
5688
5689
5690
5691
5692
5693
5694
5695
5696
5697
5698
5699
5700
5701
5702
5703
5704
5705
5706
5707
5708
5709
5710
5711
# File 'lib/polars/data_frame.rb', line 5685

def sample(
  n: nil,
  fraction: nil,
  with_replacement: false,
  shuffle: false,
  seed: nil
)
  if !n.nil? && !fraction.nil?
    raise ArgumentError, "cannot specify both `n` and `fraction`"
  end

  if n.nil? && !fraction.nil?
    fraction = Series.new("fraction", [fraction]) unless fraction.is_a?(Series)

    return _from_rbdf(
      _df.sample_frac(fraction._s, with_replacement, shuffle, seed)
    )
  end

  if n.nil?
    n = 1
  end

  n = Series.new("", [n]) unless n.is_a?(Series)

  _from_rbdf(_df.sample_n(n._s, with_replacement, shuffle, seed))
end

#schemaHash

Get the schema.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.schema
# => Polars::Schema({"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String})

Returns:

  • (Hash)


285
286
287
# File 'lib/polars/data_frame.rb', line 285

def schema
  Schema.new(columns.zip(dtypes).to_h)
end

#select(*exprs, **named_exprs) ⇒ DataFrame

Select columns from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.select("foo")
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘
df.select(["foo", "bar"])
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘
df.select(Polars.col("foo") + 1)
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# │ 3   │
# │ 4   │
# └─────┘
df.select([Polars.col("foo") + 1, Polars.col("bar") + 1])
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# │ 4   ┆ 9   │
# └─────┴─────┘
df.select(Polars.when(Polars.col("foo") > 2).then(10).otherwise(0))
# =>
# shape: (3, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i32     │
# ╞═════════╡
# │ 0       │
# │ 0       │
# │ 10      │
# └─────────┘

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



4909
4910
4911
# File 'lib/polars/data_frame.rb', line 4909

def select(*exprs, **named_exprs)
  lazy.select(*exprs, **named_exprs).collect(optimizations: QueryOptFlags._eager)
end

#select_seq(*exprs, **named_exprs) ⇒ DataFrame

Select columns from this DataFrame.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

  • exprs (Array)

    Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



4927
4928
4929
4930
4931
# File 'lib/polars/data_frame.rb', line 4927

def select_seq(*exprs, **named_exprs)
  lazy
  .select_seq(*exprs, **named_exprs)
  .collect(optimizations: QueryOptFlags._eager)
end

#serialize(file = nil) ⇒ Object

Note:

Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.

Serialize this DataFrame to a file or string.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8]
  }
)
bytes = df.serialize
Polars::DataFrame.deserialize(StringIO.new(bytes))
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘

Parameters:

  • file (Object) (defaults to: nil)

    File path or writable file-like object to which the result will be written. If set to nil (default), the output is returned as a string instead.

Returns:



870
871
872
873
874
# File 'lib/polars/data_frame.rb', line 870

def serialize(file = nil)
  serializer = _df.method(:serialize_binary)

  Utils.serialize_polars_object(serializer, file)
end

#set_sorted(column, descending: false) ⇒ DataFrame

Note:

This can lead to incorrect results if the data is NOT sorted! Use with care!

Flag a column as sorted.

This can speed up future operations.

Parameters:

  • column (Object)

    Column that is sorted.

  • descending (Boolean) (defaults to: false)

    Whether the column is sorted in descending order.

Returns:



6400
6401
6402
6403
6404
6405
6406
6407
# File 'lib/polars/data_frame.rb', line 6400

def set_sorted(
  column,
  descending: false
)
  lazy
    .set_sorted(column, descending: descending)
    .collect(optimizations: QueryOptFlags._eager)
end

#shapeArray

Get the shape of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.shape
# => [5, 1]

Returns:



164
165
166
# File 'lib/polars/data_frame.rb', line 164

def shape
  _df.shape
end

#shift(n = 1, fill_value: nil) ⇒ DataFrame

Shift values by the given period.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.shift(1)
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ null ┆ null ┆ null │
# │ 1    ┆ 6    ┆ a    │
# │ 2    ┆ 7    ┆ b    │
# └──────┴──────┴──────┘
df.shift(-1)
# =>
# shape: (3, 3)
# ┌──────┬──────┬──────┐
# │ foo  ┆ bar  ┆ ham  │
# │ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ i64  ┆ str  │
# ╞══════╪══════╪══════╡
# │ 2    ┆ 7    ┆ b    │
# │ 3    ┆ 8    ┆ c    │
# │ null ┆ null ┆ null │
# └──────┴──────┴──────┘

Parameters:

  • n (Integer) (defaults to: 1)

    Number of places to shift (may be negative).

  • fill_value (Object) (defaults to: nil)

    Fill the resulting null values with this value.

Returns:



4750
4751
4752
# File 'lib/polars/data_frame.rb', line 4750

def shift(n = 1, fill_value: nil)
  lazy.shift(n, fill_value: fill_value).collect(optimizations: QueryOptFlags._eager)
end

#shrink_to_fit(in_place: false) ⇒ DataFrame

Shrink DataFrame memory usage.

Shrinks to fit the exact capacity needed to hold the data.

Returns:



6155
6156
6157
6158
6159
6160
6161
6162
6163
6164
# File 'lib/polars/data_frame.rb', line 6155

def shrink_to_fit(in_place: false)
  if in_place
    _df.shrink_to_fit
    self
  else
    df = clone
    df._df.shrink_to_fit
    df
  end
end

#slice(offset, length = nil) ⇒ DataFrame

Get a slice of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.slice(1, 2)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 2   ┆ 7.0 ┆ b   │
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

  • offset (Integer)

    Start index. Negative indexing is supported.

  • length (Integer, nil) (defaults to: nil)

    Length of the slice. If set to nil, all rows starting at the offset will be selected.

Returns:



2482
2483
2484
2485
2486
2487
# File 'lib/polars/data_frame.rb', line 2482

def slice(offset, length = nil)
  if !length.nil? && length < 0
    length = height - offset + length
  end
  _from_rbdf(_df.slice(offset, length))
end

#sort(by, *more_by, descending: false, nulls_last: false, multithreaded: true, maintain_order: false) ⇒ DataFrame

Sort the dataframe by the given columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
)
df.sort("foo", descending: true)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Sort by multiple columns.

df.sort(
  [Polars.col("foo"), Polars.col("bar")**2],
  descending: [true, false]
)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Parameters:

  • by (Object)

    Column(s) to sort by. Accepts expression input, including selectors. Strings are parsed as column names.

  • more_by (Array)

    Additional columns to sort by, specified as positional arguments.

  • descending (Boolean) (defaults to: false)

    Sort in descending order. When sorting by multiple columns, can be specified per column by passing an array of booleans.

  • nulls_last (Boolean) (defaults to: false)

    Place null values last; can specify a single boolean applying to all columns or an array of booleans for per-column control.

  • multithreaded (Boolean) (defaults to: true)

    Sort using multiple threads.

  • maintain_order (Boolean) (defaults to: false)

    Whether the order should be maintained if elements are equal.

Returns:



2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
# File 'lib/polars/data_frame.rb', line 2170

def sort(
  by,
  *more_by,
  descending: false,
  nulls_last: false,
  multithreaded: true,
  maintain_order: false
)
  lazy
    .sort(
      by,
      *more_by,
      descending: descending,
      nulls_last: nulls_last,
      multithreaded: multithreaded,
      maintain_order: maintain_order
    )
    .collect(optimizations: QueryOptFlags._eager)
end

#sort!(by, descending: false, nulls_last: false) ⇒ DataFrame

Sort the DataFrame by column in-place.

Parameters:

  • by (String)

    By which column to sort.

  • descending (Boolean) (defaults to: false)

    Reverse/descending sort.

  • nulls_last (Boolean) (defaults to: false)

    Place null values last. Can only be used if sorted by a single column.

Returns:



2200
2201
2202
# File 'lib/polars/data_frame.rb', line 2200

def sort!(by, descending: false, nulls_last: false)
  self._df = sort(by, descending: descending, nulls_last: nulls_last)._df
end

#sql(query, table_name: "self") ⇒ DataFrame

Note:

This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.

Note:
  • The calling frame is automatically registered as a table in the SQL context under the name "self". If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level :meth:pl.sql <polars.sql>.
  • More control over registration and execution behaviour is available by using the :class:SQLContext object.
  • The SQL query executes in lazy mode before being collected and returned as a DataFrame.

Execute a SQL query against the DataFrame.

Examples:

Query the DataFrame using SQL:

df1 = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3],
    "b" => ["zz", "yy", "xx"],
    "c" => [Date.new(1999, 12, 31), Date.new(2010, 10, 10), Date.new(2077, 8, 8)]
  }
)
df1.sql("SELECT c, b FROM self WHERE a > 1")
# =>
# shape: (2, 2)
# ┌────────────┬─────┐
# │ c          ┆ b   │
# │ ---        ┆ --- │
# │ date       ┆ str │
# ╞════════════╪═════╡
# │ 2010-10-10 ┆ yy  │
# │ 2077-08-08 ┆ xx  │
# └────────────┴─────┘

Apply transformations to a DataFrame using SQL, aliasing "self" to "frame".

df1.sql(
  "
    SELECT
        a,
        (a % 2 == 0) AS a_is_even,
        CONCAT_WS(':', b, b) AS b_b,
        EXTRACT(year FROM c) AS year,
        0::float4 AS \"zero\",
    FROM frame
  ",
  table_name: "frame"
)
# =>
# shape: (3, 5)
# ┌─────┬───────────┬───────┬──────┬──────┐
# │ a   ┆ a_is_even ┆ b_b   ┆ year ┆ zero │
# │ --- ┆ ---       ┆ ---   ┆ ---  ┆ ---  │
# │ i64 ┆ bool      ┆ str   ┆ i32  ┆ f32  │
# ╞═════╪═══════════╪═══════╪══════╪══════╡
# │ 1   ┆ false     ┆ zz:zz ┆ 1999 ┆ 0.0  │
# │ 2   ┆ true      ┆ yy:yy ┆ 2010 ┆ 0.0  │
# │ 3   ┆ false     ┆ xx:xx ┆ 2077 ┆ 0.0  │
# └─────┴───────────┴───────┴──────┴──────┘

Parameters:

  • query (String)

    SQL query to execute.

  • table_name (String) (defaults to: "self")

    Optionally provide an explicit name for the table that represents the calling frame (defaults to "self").

Returns:



2272
2273
2274
2275
2276
2277
# File 'lib/polars/data_frame.rb', line 2272

def sql(query, table_name: "self")
  ctx = SQLContext.new(eager: true)
  name = table_name || "self"
  ctx.register(name, self)
  ctx.execute(query)
end

#std(ddof: 1) ⇒ DataFrame

Aggregate the columns of this DataFrame to their standard deviation value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.std
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1.0 ┆ 1.0 ┆ null │
# └─────┴─────┴──────┘
df.std(ddof: 0)
# =>
# shape: (1, 3)
# ┌──────────┬──────────┬──────┐
# │ foo      ┆ bar      ┆ ham  │
# │ ---      ┆ ---      ┆ ---  │
# │ f64      ┆ f64      ┆ str  │
# ╞══════════╪══════════╪══════╡
# │ 0.816497 ┆ 0.816497 ┆ null │
# └──────────┴──────────┴──────┘

Parameters:

  • ddof (Integer) (defaults to: 1)

    Degrees of freedom

Returns:



5349
5350
5351
# File 'lib/polars/data_frame.rb', line 5349

def std(ddof: 1)
  lazy.std(ddof: ddof).collect(optimizations: QueryOptFlags._eager)
end

#sumDataFrame

Aggregate the columns of this DataFrame to their sum value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"],
  }
)
df.sum
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 6   ┆ 21  ┆ null │
# └─────┴─────┴──────┘

Returns:



5222
5223
5224
# File 'lib/polars/data_frame.rb', line 5222

def sum
  lazy.sum.collect(optimizations: QueryOptFlags._eager)
end

#sum_horizontal(ignore_nulls: true) ⇒ Series

Sum all values horizontally across columns.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [4.0, 5.0, 6.0]
  }
)
df.sum_horizontal
# =>
# shape: (3,)
# Series: 'sum' [f64]
# [
#         5.0
#         7.0
#         9.0
# ]

Parameters:

  • ignore_nulls (Boolean) (defaults to: true)

    Ignore null values (default). If set to false, any null value in the input will lead to a null output.

Returns:



5250
5251
5252
5253
5254
# File 'lib/polars/data_frame.rb', line 5250

def sum_horizontal(ignore_nulls: true)
  select(
    sum: F.sum_horizontal(F.all, ignore_nulls: ignore_nulls)
  ).to_series
end

#tail(n = 5) ⇒ DataFrame

Get the last n rows.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.tail(3)
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8   ┆ c   │
# │ 4   ┆ 9   ┆ d   │
# │ 5   ┆ 10  ┆ e   │
# └─────┴─────┴─────┘

Parameters:

  • n (Integer) (defaults to: 5)

    Number of rows to return.

Returns:



2577
2578
2579
# File 'lib/polars/data_frame.rb', line 2577

def tail(n = 5)
  _from_rbdf(_df.tail(n))
end

#to_aArray

Returns an array representing the DataFrame

Returns:



402
403
404
# File 'lib/polars/data_frame.rb', line 402

def to_a
  rows(named: true)
end

#to_csv(**options) ⇒ String

Write to comma-separated values (CSV) string.

Returns:



1081
1082
1083
# File 'lib/polars/data_frame.rb', line 1081

def to_csv(**options)
  write_csv(**options)
end

#to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) ⇒ DataFrame

Get one hot encoded dummy variables.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2],
    "bar" => [3, 4],
    "ham" => ["a", "b"]
  }
)
df.to_dummies
# =>
# shape: (2, 6)
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │
# │ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
# │ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    │
# ╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
# │ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     │
# │ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     │
# └───────┴───────┴───────┴───────┴───────┴───────┘

Parameters:

  • columns (Array) (defaults to: nil)

    A subset of columns to convert to dummy variables. nil means "all columns".

  • separator (String) (defaults to: "_")

    Separator/delimiter used when generating column names.

  • drop_first (Boolean) (defaults to: false)

    Remove the first category from the variables being encoded.

  • drop_nulls (Boolean) (defaults to: false)

    If there are nil values in the series, a null column is not generated

Returns:



5510
5511
5512
5513
5514
5515
# File 'lib/polars/data_frame.rb', line 5510

def to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false)
  if columns.is_a?(::String)
    columns = [columns]
  end
  _from_rbdf(_df.to_dummies(columns, separator, drop_first, drop_nulls))
end

#to_h(as_series: true) ⇒ Hash

Convert DataFrame to a hash mapping column name to values.

Returns:

  • (Hash)


763
764
765
766
767
768
769
# File 'lib/polars/data_frame.rb', line 763

def to_h(as_series: true)
  if as_series
    get_columns.to_h { |s| [s.name, s] }
  else
    get_columns.to_h { |s| [s.name, s.to_a] }
  end
end

#to_hashesArray

Convert every row to a hash.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]})
df.to_hashes
# =>
# [{"foo"=>1, "bar"=>4}, {"foo"=>2, "bar"=>5}, {"foo"=>3, "bar"=>6}]

Returns:



780
781
782
# File 'lib/polars/data_frame.rb', line 780

def to_hashes
  rows(named: true)
end

#to_numoNumo::NArray

Convert DataFrame to a 2D Numo array.

This operation clones data.

Examples:

df = Polars::DataFrame.new(
  {"foo" => [1, 2, 3], "bar" => [6, 7, 8], "ham" => ["a", "b", "c"]}
)
df.to_numo.class
# => Numo::RObject

Returns:

  • (Numo::NArray)


796
797
798
799
800
801
802
803
# File 'lib/polars/data_frame.rb', line 796

def to_numo
  out = _df.to_numo
  if out.nil?
    Numo::NArray.vstack(width.times.map { |i| to_series(i).to_numo }).transpose
  else
    out
  end
end

#to_sString Also known as: inspect

Returns a string representing the DataFrame.

Returns:



394
395
396
# File 'lib/polars/data_frame.rb', line 394

def to_s
  _df.to_s
end

#to_series(index = 0) ⇒ Series

Select column as Series at index location.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.to_series(1)
# =>
# shape: (3,)
# Series: 'bar' [i64]
# [
#         6
#         7
#         8
# ]

Parameters:

  • index (Integer) (defaults to: 0)

    Location of selection.

Returns:



831
832
833
834
835
836
# File 'lib/polars/data_frame.rb', line 831

def to_series(index = 0)
  if index < 0
    index = columns.length + index
  end
  Utils.wrap_s(_df.to_series(index))
end

#to_struct(name = "") ⇒ Series

Convert a DataFrame to a Series of type Struct.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4, 5],
    "b" => ["one", "two", "three", "four", "five"]
  }
)
df.to_struct("nums")
# =>
# shape: (5,)
# Series: 'nums' [struct[2]]
# [
#         {1,"one"}
#         {2,"two"}
#         {3,"three"}
#         {4,"four"}
#         {5,"five"}
# ]

Parameters:

  • name (String) (defaults to: "")

    Name for the struct Series

Returns:



6296
6297
6298
# File 'lib/polars/data_frame.rb', line 6296

def to_struct(name = "")
  Utils.wrap_s(_df.to_struct(name))
end

#top_k(k, by:, reverse: false) ⇒ DataFrame

Return the k largest rows.

Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call sort after this function if you wish the output to be sorted.

Examples:

Get the rows which contain the 4 largest values in column b.

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [2, 1, 1, 3, 2, 1]
  }
)
df.top_k(4, by: "b")
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 3   │
# │ a   ┆ 2   │
# │ b   ┆ 2   │
# │ b   ┆ 1   │
# └─────┴─────┘

Get the rows which contain the 4 largest values when sorting on column b and a.

df.top_k(4, by: ["b", "a"])
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ b   ┆ 3   │
# │ b   ┆ 2   │
# │ a   ┆ 2   │
# │ c   ┆ 1   │
# └─────┴─────┘

Parameters:

  • k (Integer)

    Number of rows to return.

  • by (Object)

    Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.

  • reverse (Object) (defaults to: false)

    Consider the k smallest elements of the by column(s) (instead of the k largest). This can be specified per column by passing an array of booleans.

Returns:



2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
# File 'lib/polars/data_frame.rb', line 2333

def top_k(
  k,
  by:,
  reverse: false
)
  lazy
  .top_k(k, by: by, reverse: reverse)
  .collect(
    optimizations: QueryOptFlags.new(
      projection_pushdown: false,
      predicate_pushdown: false,
      comm_subplan_elim: false,
      slice_pushdown: true
    )
  )
end

#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame

Note:

This is a very expensive operation. Perhaps you can do it differently.

Transpose a DataFrame over the diagonal.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => [1, 2, 3]})
df.transpose(include_header: true)
# =>
# shape: (2, 4)
# ┌────────┬──────────┬──────────┬──────────┐
# │ column ┆ column_0 ┆ column_1 ┆ column_2 │
# │ ---    ┆ ---      ┆ ---      ┆ ---      │
# │ str    ┆ i64      ┆ i64      ┆ i64      │
# ╞════════╪══════════╪══════════╪══════════╡
# │ a      ┆ 1        ┆ 2        ┆ 3        │
# │ b      ┆ 1        ┆ 2        ┆ 3        │
# └────────┴──────────┴──────────┴──────────┘

Replace the auto-generated column names with a list

df.transpose(include_header: false, column_names: ["a", "b", "c"])
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 2   ┆ 3   │
# │ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┘

Include the header as a separate column

df.transpose(
  include_header: true, header_name: "foo", column_names: ["a", "b", "c"]
)
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬─────┐
# │ foo ┆ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ a   ┆ 1   ┆ 2   ┆ 3   │
# │ b   ┆ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┴─────┘

Parameters:

  • include_header (Boolean) (defaults to: false)

    If set, the column names will be added as first column.

  • header_name (String) (defaults to: "column")

    If include_header is set, this determines the name of the column that will be inserted.

  • column_names (Array) (defaults to: nil)

    Optional generator/iterator that yields column names. Will be used to replace the columns in the DataFrame.

Returns:



1599
1600
1601
1602
# File 'lib/polars/data_frame.rb', line 1599

def transpose(include_header: false, header_name: "column", column_names: nil)
  keep_names_as = include_header ? header_name : nil
  _from_rbdf(_df.transpose(keep_names_as, column_names))
end

#unique(maintain_order: false, subset: nil, keep: "any") ⇒ DataFrame

Note:

Note that this fails if there is a column of type List in the DataFrame or subset.

Drop duplicate rows from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 1, 2, 3, 4, 5],
    "b" => [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
    "c" => [true, true, true, false, true, true]
  }
)
df.unique(maintain_order: true)
# =>
# shape: (5, 3)
# ┌─────┬─────┬───────┐
# │ a   ┆ b   ┆ c     │
# │ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ bool  │
# ╞═════╪═════╪═══════╡
# │ 1   ┆ 0.5 ┆ true  │
# │ 2   ┆ 1.0 ┆ true  │
# │ 3   ┆ 2.0 ┆ false │
# │ 4   ┆ 3.0 ┆ true  │
# │ 5   ┆ 3.0 ┆ true  │
# └─────┴─────┴───────┘

Parameters:

  • maintain_order (Boolean) (defaults to: false)

    Keep the same order as the original DataFrame. This requires more work to compute.

  • subset (Object) (defaults to: nil)

    Subset to use to compare rows.

  • keep ("first", "last") (defaults to: "any")

    Which of the duplicate rows to keep (in conjunction with subset).

Returns:



5555
5556
5557
5558
5559
5560
5561
5562
# File 'lib/polars/data_frame.rb', line 5555

def unique(maintain_order: false, subset: nil, keep: "any")
  self._from_rbdf(
    lazy
      .unique(maintain_order: maintain_order, subset: subset, keep: keep)
      .collect(optimizations: QueryOptFlags._eager)
      ._df
  )
end

#unnest(columns, *more_columns, separator: nil) ⇒ DataFrame

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Examples:

df = Polars::DataFrame.new(
  {
    "before" => ["foo", "bar"],
    "t_a" => [1, 2],
    "t_b" => ["a", "b"],
    "t_c" => [true, nil],
    "t_d" => [[1, 2], [3]],
    "after" => ["baz", "womp"]
  }
).select(["before", Polars.struct(Polars.col("^t_.$")).alias("t_struct"), "after"])
df.unnest("t_struct")
# =>
# shape: (2, 6)
# ┌────────┬─────┬─────┬──────┬───────────┬───────┐
# │ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
# │ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
# │ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
# ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
# │ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
# │ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
# └────────┴─────┴─────┴──────┴───────────┴───────┘

Parameters:

  • columns (Object)

    Name of the struct column(s) that should be unnested.

  • more_columns (Array)

    Additional columns to unnest, specified as positional arguments.

  • separator (String) (defaults to: nil)

    Rename output column names as combination of the struct column name, name separator and field name.

Returns:



6337
6338
6339
# File 'lib/polars/data_frame.rb', line 6337

def unnest(columns, *more_columns, separator: nil)
  lazy.unnest(columns, *more_columns, separator: separator).collect(optimizations: QueryOptFlags._eager)
end

#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
)
df.unpivot(Polars.cs.numeric, index: "a")
# =>
# shape: (6, 3)
# ┌─────┬──────────┬───────┐
# │ a   ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ x   ┆ b        ┆ 1     │
# │ y   ┆ b        ┆ 3     │
# │ z   ┆ b        ┆ 5     │
# │ x   ┆ c        ┆ 2     │
# │ y   ┆ c        ┆ 4     │
# │ z   ┆ c        ┆ 6     │
# └─────┴──────────┴───────┘

Parameters:

  • on (Object) (defaults to: nil)

    Column(s) or selector(s) to use as values variables; if on is empty all columns that are not in index will be used.

  • index (Object) (defaults to: nil)

    Column(s) or selector(s) to use as identifier variables.

  • variable_name (Object) (defaults to: nil)

    Name to give to the variable column. Defaults to "variable"

  • value_name (Object) (defaults to: nil)

    Name to give to the value column. Defaults to "value"

Returns:



4475
4476
4477
4478
4479
4480
# File 'lib/polars/data_frame.rb', line 4475

def unpivot(on = nil, index: nil, variable_name: nil, value_name: nil)
  on = on.nil? ? [] : Utils._expand_selectors(self, on)
  index = index.nil? ? [] : Utils._expand_selectors(self, index)

  _from_rbdf(_df.unpivot(on, index, value_name, variable_name))
end

#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame

Note:

This functionality is experimental and may be subject to changes without it being considered a breaking change.

Unstack a long table to a wide form without doing an aggregation.

This can be much faster than a pivot, because it can skip the grouping phase.

Examples:

df = Polars::DataFrame.new(
  {
    "col1" => "A".."I",
    "col2" => Polars.arange(0, 9, eager: true)
  }
)
# =>
# shape: (9, 2)
# ┌──────┬──────┐
# │ col1 ┆ col2 │
# │ ---  ┆ ---  │
# │ str  ┆ i64  │
# ╞══════╪══════╡
# │ A    ┆ 0    │
# │ B    ┆ 1    │
# │ C    ┆ 2    │
# │ D    ┆ 3    │
# │ E    ┆ 4    │
# │ F    ┆ 5    │
# │ G    ┆ 6    │
# │ H    ┆ 7    │
# │ I    ┆ 8    │
# └──────┴──────┘
df.unstack(step: 3, how: "vertical")
# =>
# shape: (3, 6)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ A      ┆ D      ┆ G      ┆ 0      ┆ 3      ┆ 6      │
# │ B      ┆ E      ┆ H      ┆ 1      ┆ 4      ┆ 7      │
# │ C      ┆ F      ┆ I      ┆ 2      ┆ 5      ┆ 8      │
# └────────┴────────┴────────┴────────┴────────┴────────┘
df.unstack(step: 3, how: "horizontal")
# =>
# shape: (3, 6)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ A      ┆ B      ┆ C      ┆ 0      ┆ 1      ┆ 2      │
# │ D      ┆ E      ┆ F      ┆ 3      ┆ 4      ┆ 5      │
# │ G      ┆ H      ┆ I      ┆ 6      ┆ 7      ┆ 8      │
# └────────┴────────┴────────┴────────┴────────┴────────┘

Parameters:

  • step

    Integer Number of rows in the unstacked frame.

  • how ("vertical", "horizontal") (defaults to: "vertical")

    Direction of the unstack.

  • columns (Object) (defaults to: nil)

    Column to include in the operation.

  • fill_values (Object) (defaults to: nil)

    Fill values that don't fit the new size with this value.

Returns:



4553
4554
4555
4556
4557
4558
4559
4560
4561
4562
4563
4564
4565
4566
4567
4568
4569
4570
4571
4572
4573
4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586
4587
4588
4589
4590
4591
4592
4593
4594
4595
4596
4597
4598
4599
4600
4601
4602
4603
4604
# File 'lib/polars/data_frame.rb', line 4553

def unstack(step:, how: "vertical", columns: nil, fill_values: nil)
  if !columns.nil?
    df = select(columns)
  else
    df = self
  end

  height = df.height
  if how == "vertical"
    n_rows = step
    n_cols = (height / n_rows.to_f).ceil
  else
    n_cols = step
    n_rows = (height / n_cols.to_f).ceil
  end

  n_fill = n_cols * n_rows - height

  if n_fill > 0
    if !fill_values.is_a?(::Array)
      fill_values = [fill_values] * df.width
    end

    df = df.select(
      df.get_columns.zip(fill_values).map do |s, next_fill|
        s.extend_constant(next_fill, n_fill)
      end
    )
  end

  if how == "horizontal"
    df = (
      df.with_columns(
        (Polars.arange(0, n_cols * n_rows, eager: true) % n_cols).alias(
          "__sort_order"
        )
      )
      .sort("__sort_order")
      .drop("__sort_order")
    )
  end

  zfill_val = Math.log10(n_cols).floor + 1
  slices =
    df.get_columns.flat_map do |s|
      n_cols.times.map do |slice_nbr|
        s.slice(slice_nbr * n_rows, n_rows).alias("%s_%0#{zfill_val}d" % [s.name, slice_nbr])
      end
    end

  _from_rbdf(DataFrame.new(slices)._df)
end

#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ DataFrame

Note:

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

Note:

This is syntactic sugar for a left/inner join that preserves the order of the left DataFrame by default, with an optional coalesce when include_nulls: false.

Update the values in this DataFrame with the values in other.

Examples:

Update df values with the non-null values in new_df, by row index:

df = Polars::DataFrame.new(
  {
    "A" => [1, 2, 3, 4],
    "B" => [400, 500, 600, 700]
  }
)
new_df = Polars::DataFrame.new(
  {
    "B" => [-66, nil, -99],
    "C" => [5, 3, 1]
  }
)
df.update(new_df)
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -66 │
# │ 2   ┆ 500 │
# │ 3   ┆ -99 │
# │ 4   ┆ 700 │
# └─────┴─────┘

Update df values with the non-null values in new_df, by row index, but only keeping those rows that are common to both frames:

df.update(new_df, how: "inner")
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -66 │
# │ 2   ┆ 500 │
# │ 3   ┆ -99 │
# └─────┴─────┘

Update df values with the non-null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

df.update(new_df, left_on: ["A"], right_on: ["C"], how: "full")
# =>
# shape: (5, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ -99 │
# │ 2   ┆ 500 │
# │ 3   ┆ 600 │
# │ 4   ┆ 700 │
# │ 5   ┆ -66 │
# └─────┴─────┘

Update df values including null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:

df.update(new_df, left_on: "A", right_on: "C", how: "full", include_nulls: true)
# =>
# shape: (5, 2)
# ┌─────┬──────┐
# │ A   ┆ B    │
# │ --- ┆ ---  │
# │ i64 ┆ i64  │
# ╞═════╪══════╡
# │ 1   ┆ -99  │
# │ 2   ┆ 500  │
# │ 3   ┆ null │
# │ 4   ┆ 700  │
# │ 5   ┆ -66  │
# └─────┴──────┘

Parameters:

  • other (DataFrame)

    DataFrame that will be used to update the values

  • on (Object) (defaults to: nil)

    Column names that will be joined on. If set to nil (default), the implicit row index of each frame is used as a join key.

  • how ('left', 'inner', 'full') (defaults to: "left")
    • 'left' will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row's key.
    • 'inner' keeps only those rows where the key exists in both frames.
    • 'full' will update existing rows where the key matches while also adding any new rows contained in the given frame.
  • left_on (Object) (defaults to: nil)

    Join column(s) of the left DataFrame.

  • right_on (Object) (defaults to: nil)

    Join column(s) of the right DataFrame.

  • include_nulls (Boolean) (defaults to: false)

    Overwrite values in the left frame with null values from the right frame. If set to false (default), null values in the right frame are ignored.

  • maintain_order ('none', 'left', 'right', 'left_right', 'right_left') (defaults to: "left")

    Which order of rows from the inputs to preserve. See DataFrame.join for details. Unlike join this function preserves the left order by default.

Returns:



6517
6518
6519
6520
6521
6522
6523
6524
6525
6526
6527
6528
6529
6530
6531
6532
6533
6534
6535
6536
6537
6538
# File 'lib/polars/data_frame.rb', line 6517

def update(
  other,
  on: nil,
  how: "left",
  left_on: nil,
  right_on: nil,
  include_nulls: false,
  maintain_order: "left"
)
  Utils.require_same_type(self, other)
  lazy
  .update(
    other.lazy,
    on: on,
    how: how,
    left_on: left_on,
    right_on: right_on,
    include_nulls: include_nulls,
    maintain_order: maintain_order
  )
  .collect(optimizations: QueryOptFlags._eager)
end

#upsample(time_column:, every:, group_by: nil, maintain_order: false) ⇒ DataFrame

Upsample a DataFrame at a regular frequency.

The every and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)
  • 1us (1 microsecond)
  • 1ms (1 millisecond)
  • 1s (1 second)
  • 1m (1 minute)
  • 1h (1 hour)
  • 1d (1 day)
  • 1w (1 week)
  • 1mo (1 calendar month)
  • 1y (1 calendar year)
  • 1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

Examples:

Upsample a DataFrame by a certain interval.

df = Polars::DataFrame.new(
  {
    "time" => [
      DateTime.new(2021, 2, 1),
      DateTime.new(2021, 4, 1),
      DateTime.new(2021, 5, 1),
      DateTime.new(2021, 6, 1)
    ],
    "groups" => ["A", "B", "A", "B"],
    "values" => [0, 1, 2, 3]
  }
).set_sorted("time")
df.upsample(
  time_column: "time", every: "1mo", group_by: "groups", maintain_order: true
).select(Polars.all.forward_fill)
# =>
# shape: (7, 3)
# ┌─────────────────────┬────────┬────────┐
# │ time                ┆ groups ┆ values │
# │ ---                 ┆ ---    ┆ ---    │
# │ datetime[ns]        ┆ str    ┆ i64    │
# ╞═════════════════════╪════════╪════════╡
# │ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-03-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-04-01 00:00:00 ┆ A      ┆ 0      │
# │ 2021-05-01 00:00:00 ┆ A      ┆ 2      │
# │ 2021-04-01 00:00:00 ┆ B      ┆ 1      │
# │ 2021-05-01 00:00:00 ┆ B      ┆ 1      │
# │ 2021-06-01 00:00:00 ┆ B      ┆ 3      │
# └─────────────────────┴────────┴────────┘

Parameters:

  • time_column (Object)

    time column will be used to determine a date_range. Note that this column has to be sorted for the output to make sense.

  • every (String)

    interval will start 'every' duration

  • group_by (Object) (defaults to: nil)

    First group by these columns and then upsample for every group

  • maintain_order (Boolean) (defaults to: false)

    Keep the ordering predictable. This is slower.

Returns:



3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
# File 'lib/polars/data_frame.rb', line 3226

def upsample(
  time_column:,
  every:,
  group_by: nil,
  maintain_order: false
)
  if group_by.nil?
    group_by = []
  end
  if group_by.is_a?(::String)
    group_by = [group_by]
  end

  every = Utils.parse_as_duration_string(every)

  _from_rbdf(
    _df.upsample(group_by, time_column, every, maintain_order)
  )
end

#var(ddof: 1) ⇒ DataFrame

Aggregate the columns of this DataFrame to their variance value.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.var
# =>
# shape: (1, 3)
# ┌─────┬─────┬──────┐
# │ foo ┆ bar ┆ ham  │
# │ --- ┆ --- ┆ ---  │
# │ f64 ┆ f64 ┆ str  │
# ╞═════╪═════╪══════╡
# │ 1.0 ┆ 1.0 ┆ null │
# └─────┴─────┴──────┘
df.var(ddof: 0)
# =>
# shape: (1, 3)
# ┌──────────┬──────────┬──────┐
# │ foo      ┆ bar      ┆ ham  │
# │ ---      ┆ ---      ┆ ---  │
# │ f64      ┆ f64      ┆ str  │
# ╞══════════╪══════════╪══════╡
# │ 0.666667 ┆ 0.666667 ┆ null │
# └──────────┴──────────┴──────┘

Parameters:

  • ddof (Integer) (defaults to: 1)

    Degrees of freedom

Returns:



5390
5391
5392
# File 'lib/polars/data_frame.rb', line 5390

def var(ddof: 1)
  lazy.var(ddof: ddof).collect(optimizations: QueryOptFlags._eager)
end

#vstack(other, in_place: false) ⇒ DataFrame

Grow this DataFrame vertically by stacking a DataFrame to it.

Examples:

df1 = Polars::DataFrame.new(
  {
    "foo" => [1, 2],
    "bar" => [6, 7],
    "ham" => ["a", "b"]
  }
)
df2 = Polars::DataFrame.new(
  {
    "foo" => [3, 4],
    "bar" => [8, 9],
    "ham" => ["c", "d"]
  }
)
df1.vstack(df2)
# =>
# shape: (4, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# │ 3   ┆ 8   ┆ c   │
# │ 4   ┆ 9   ┆ d   │
# └─────┴─────┴─────┘

Parameters:

  • other (DataFrame)

    DataFrame to stack.

  • in_place (Boolean) (defaults to: false)

    Modify in place

Returns:



3797
3798
3799
3800
3801
3802
3803
3804
# File 'lib/polars/data_frame.rb', line 3797

def vstack(other, in_place: false)
  if in_place
    _df.vstack_mut(other._df)
    self
  else
    _from_rbdf(_df.vstack(other._df))
  end
end

#widthInteger

Get the width of the DataFrame.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 2, 3, 4, 5]})
df.width
# => 1

Returns:

  • (Integer)


191
192
193
# File 'lib/polars/data_frame.rb', line 191

def width
  _df.width
end

#with_columns(*exprs, **named_exprs) ⇒ DataFrame

Add columns to this DataFrame.

Added columns will replace existing columns with the same name.

Examples:

Pass an expression to add it as a new column.

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
)
df.with_columns((Polars.col("a") ** 2).alias("a^2"))
# =>
# shape: (4, 4)
# ┌─────┬──────┬───────┬─────┐
# │ a   ┆ b    ┆ c     ┆ a^2 │
# │ --- ┆ ---  ┆ ---   ┆ --- │
# │ i64 ┆ f64  ┆ bool  ┆ i64 │
# ╞═════╪══════╪═══════╪═════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   │
# │ 3   ┆ 10.0 ┆ false ┆ 9   │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  │
# └─────┴──────┴───────┴─────┘

Added columns will replace existing columns with the same name.

df.with_columns(Polars.col("a").cast(Polars::Float64))
# =>
# shape: (4, 3)
# ┌─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     │
# │ --- ┆ ---  ┆ ---   │
# │ f64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╡
# │ 1.0 ┆ 0.5  ┆ true  │
# │ 2.0 ┆ 4.0  ┆ true  │
# │ 3.0 ┆ 10.0 ┆ false │
# │ 4.0 ┆ 13.0 ┆ true  │
# └─────┴──────┴───────┘

Multiple columns can be added by passing a list of expressions.

df.with_columns(
  [
    (Polars.col("a") ** 2).alias("a^2"),
    (Polars.col("b") / 2).alias("b/2"),
    (Polars.col("c").not_).alias("not c"),
  ]
)
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Multiple columns also can be added using positional arguments instead of a list.

df.with_columns(
  (Polars.col("a") ** 2).alias("a^2"),
  (Polars.col("b") / 2).alias("b/2"),
  (Polars.col("c").not_).alias("not c"),
)
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬─────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2 ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ i64 ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪═════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1   ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4   ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9   ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16  ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴─────┴──────┴───────┘

Use keyword arguments to easily name your expression inputs.

df.with_columns(
  ab: Polars.col("a") * Polars.col("b"),
  not_c: Polars.col("c").not_
)
# =>
# shape: (4, 5)
# ┌─────┬──────┬───────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ ab   ┆ not_c │
# │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
# └─────┴──────┴───────┴──────┴───────┘

Parameters:

  • exprs (Array)

    Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



5041
5042
5043
# File 'lib/polars/data_frame.rb', line 5041

def with_columns(*exprs, **named_exprs)
  lazy.with_columns(*exprs, **named_exprs).collect(optimizations: QueryOptFlags._eager)
end

#with_columns_seq(*exprs, **named_exprs) ⇒ DataFrame

Add columns to this DataFrame.

Added columns will replace existing columns with the same name.

This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.

Parameters:

  • exprs (Array)

    Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.

  • named_exprs (Hash)

    Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.

Returns:



5061
5062
5063
5064
5065
5066
5067
5068
# File 'lib/polars/data_frame.rb', line 5061

def with_columns_seq(
  *exprs,
  **named_exprs
)
  lazy
  .with_columns_seq(*exprs, **named_exprs)
  .collect(optimizations: QueryOptFlags._eager)
end

#with_row_index(name: "index", offset: 0) ⇒ DataFrame

Add a column at index 0 that counts the rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
)
df.with_row_index
# =>
# shape: (3, 3)
# ┌───────┬─────┬─────┐
# │ index ┆ a   ┆ b   │
# │ ---   ┆ --- ┆ --- │
# │ u32   ┆ i64 ┆ i64 │
# ╞═══════╪═════╪═════╡
# │ 0     ┆ 1   ┆ 2   │
# │ 1     ┆ 3   ┆ 4   │
# │ 2     ┆ 5   ┆ 6   │
# └───────┴─────┴─────┘

Parameters:

  • name (String) (defaults to: "index")

    Name of the column to add.

  • offset (Integer) (defaults to: 0)

    Start the row count at this offset.

Returns:



2740
2741
2742
# File 'lib/polars/data_frame.rb', line 2740

def with_row_index(name: "index", offset: 0)
  _from_rbdf(_df.with_row_index(name, offset))
end

#write_avro(file, compression = "uncompressed", name: "") ⇒ nil

Write to Apache Avro file.

Parameters:

  • file (String)

    File path to which the file should be written.

  • compression ("uncompressed", "snappy", "deflate") (defaults to: "uncompressed")

    Compression method. Defaults to "uncompressed".

  • name (String) (defaults to: "")

    Schema name. Defaults to empty string.

Returns:

  • (nil)


1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
# File 'lib/polars/data_frame.rb', line 1095

def write_avro(file, compression = "uncompressed", name: "")
  if compression.nil?
    compression = "uncompressed"
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  if name.nil?
    name = ""
  end

  _df.write_avro(file, compression, name)
end

#write_csv(file = nil, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, storage_options: nil, credential_provider: "auto", retries: 2) ⇒ String?

Write to comma-separated values (CSV) file.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.write_csv("file.csv")

Parameters:

  • file (String, nil) (defaults to: nil)

    File path to which the result should be written. If set to nil (default), the output is returned as a string instead.

  • include_header (Boolean) (defaults to: true)

    Whether to include header in the CSV output.

  • separator (String) (defaults to: ",")

    Separate CSV fields with this symbol.

  • quote_char (String) (defaults to: '"')

    Byte to use as quoting character.

  • batch_size (Integer) (defaults to: 1024)

    Number of rows that will be processed per thread.

  • datetime_format (String, nil) (defaults to: nil)

    A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame's Datetime cols (if any).

  • date_format (String, nil) (defaults to: nil)

    A format string, with the specifiers defined by the chrono Rust crate.

  • time_format (String, nil) (defaults to: nil)

    A format string, with the specifiers defined by the chrono Rust crate.

  • float_precision (Integer, nil) (defaults to: nil)

    Number of decimal places to write, applied to both Float32 and Float64 datatypes.

  • null_value (String, nil) (defaults to: nil)

    A string representing null values (defaulting to the empty string).

Returns:



999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
# File 'lib/polars/data_frame.rb', line 999

def write_csv(
  file = nil,
  include_bom: false,
  include_header: true,
  separator: ",",
  line_terminator: "\n",
  quote_char: '"',
  batch_size: 1024,
  datetime_format: nil,
  date_format: nil,
  time_format: nil,
  float_scientific: nil,
  float_precision: nil,
  decimal_comma: false,
  null_value: nil,
  quote_style: nil,
  storage_options: nil,
  credential_provider: "auto",
  retries: 2
)
  Utils._check_arg_is_1byte("separator", separator, false)
  Utils._check_arg_is_1byte("quote_char", quote_char, true)
  if null_value == ""
    null_value = nil
  end

  if file.nil?
    buffer = StringIO.new
    buffer.set_encoding(Encoding::BINARY)
    lazy.sink_csv(
      buffer,
      include_bom: include_bom,
      include_header: include_header,
      separator: separator,
      line_terminator: line_terminator,
      quote_char: quote_char,
      batch_size: batch_size,
      datetime_format: datetime_format,
      date_format: date_format,
      time_format: time_format,
      float_scientific: float_scientific,
      float_precision: float_precision,
      decimal_comma: decimal_comma,
      null_value: null_value,
      quote_style: quote_style,
      storage_options: storage_options,
      credential_provider: credential_provider,
      retries: retries
    )
    return buffer.string.force_encoding(Encoding::UTF_8)
  end

  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  lazy.sink_csv(
    file,
    include_bom: include_bom,
    include_header: include_header,
    separator: separator,
    line_terminator: line_terminator,
    quote_char: quote_char,
    batch_size: batch_size,
    datetime_format: datetime_format,
    date_format: date_format,
    time_format: time_format,
    float_scientific: float_scientific,
    float_precision: float_precision,
    decimal_comma: decimal_comma,
    null_value: null_value,
    quote_style: quote_style,
    storage_options: storage_options,
    credential_provider: credential_provider,
    retries: retries
  )
  nil
end

#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer

Note:

This functionality is experimental. It may be changed at any point without it being considered a breaking change.

Write the data in a Polars DataFrame to a database.

Parameters:

  • table_name (String)

    Schema-qualified name of the table to create or append to in the target SQL database.

  • connection (Object) (defaults to: nil)

    An existing Active Record connection against the target database.

  • if_table_exists ('append', 'replace', 'fail') (defaults to: "fail")

    The insert mode:

    • 'replace' will create a new database table, overwriting an existing one.
    • 'append' will append to an existing table.
    • 'fail' will fail if table already exists.

Returns:

  • (Integer)


1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
# File 'lib/polars/data_frame.rb', line 1300

def write_database(table_name, connection = nil, if_table_exists: "fail")
  if !defined?(ActiveRecord)
    raise Error, "Active Record not available"
  elsif ActiveRecord::VERSION::MAJOR < 7
    raise Error, "Requires Active Record 7+"
  end

  valid_write_modes = ["append", "replace", "fail"]
  if !valid_write_modes.include?(if_table_exists)
    msg = "write_database `if_table_exists` must be one of #{valid_write_modes.inspect}, got #{if_table_exists.inspect}"
    raise ArgumentError, msg
  end

  with_connection(connection) do |connection|
    table_exists = connection.table_exists?(table_name)
    if table_exists && if_table_exists == "fail"
      raise ArgumentError, "Table already exists"
    end

    create_table = !table_exists || if_table_exists == "replace"
    maybe_transaction(connection, create_table) do
      if create_table
        mysql = connection.adapter_name.match?(/mysql|trilogy/i)
        force = if_table_exists == "replace"
        connection.create_table(table_name, id: false, force: force) do |t|
          schema.each do |c, dtype|
            options = {}
            column_type =
              case dtype
              when Binary
                :binary
              when Boolean
                :boolean
              when Date
                :date
              when Datetime
                :datetime
              when Decimal
                if mysql
                  options[:precision] = dtype.precision || 65
                  options[:scale] = dtype.scale || 30
                end
                :decimal
              when Float32
                options[:limit] = 24
                :float
              when Float64
                options[:limit] = 53
                :float
              when Int8
                options[:limit] = 1
                :integer
              when Int16
                options[:limit] = 2
                :integer
              when Int32
                options[:limit] = 4
                :integer
              when Int64
                options[:limit] = 8
                :integer
              when UInt8
                if mysql
                  options[:limit] = 1
                  options[:unsigned] = true
                else
                  options[:limit] = 2
                end
                :integer
              when UInt16
                if mysql
                  options[:limit] = 2
                  options[:unsigned] = true
                else
                  options[:limit] = 4
                end
                :integer
              when UInt32
                if mysql
                  options[:limit] = 4
                  options[:unsigned] = true
                else
                  options[:limit] = 8
                end
                :integer
              when UInt64
                if mysql
                  options[:limit] = 8
                  options[:unsigned] = true
                  :integer
                else
                  options[:precision] = 20
                  options[:scale] = 0
                  :decimal
                end
              when String
                :text
              when Time
                :time
              else
                raise ArgumentError, "column type not supported yet: #{dtype}"
              end
            t.column c, column_type, **options
          end
        end
      end

      quoted_table = connection.quote_table_name(table_name)
      quoted_columns = columns.map { |c| connection.quote_column_name(c) }
      rows = cast({Polars::UInt64 => Polars::String}).rows(named: false).map { |row| "(#{row.map { |v| connection.quote(v) }.join(", ")})" }
      connection.exec_update("INSERT INTO #{quoted_table} (#{quoted_columns.join(", ")}) VALUES #{rows.join(", ")}")
    end
  end
end

#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil

Write DataFrame as delta table.

Parameters:

  • target (Object)

    URI of a table or a DeltaTable object.

  • mode ("error", "append", "overwrite", "ignore", "merge") (defaults to: "error")

    How to handle existing data.

  • storage_options (Hash) (defaults to: nil)

    Extra options for the storage backends supported by deltalake-rb.

  • delta_write_options (Hash) (defaults to: nil)

    Additional keyword arguments while writing a Delta lake Table.

  • delta_merge_options (Hash) (defaults to: nil)

    Keyword arguments which are required to MERGE a Delta lake Table.

Returns:

  • (nil)


1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
# File 'lib/polars/data_frame.rb', line 1463

def write_delta(
  target,
  mode: "error",
  storage_options: nil,
  delta_write_options: nil,
  delta_merge_options: nil
)
  Polars.send(:_check_if_delta_available)

  if Utils.pathlike?(target)
    target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false)
  end

  data = self

  if mode == "merge"
    if delta_merge_options.nil?
      msg = "You need to pass delta_merge_options with at least a given predicate for `MERGE` to work."
      raise ArgumentError, msg
    end
    if target.is_a?(::String)
      dt = DeltaLake::Table.new(target, storage_options: storage_options)
    else
      dt = target
    end

    predicate = delta_merge_options.delete(:predicate)
    dt.merge(data, predicate, **delta_merge_options)
  else
    delta_write_options ||= {}

    DeltaLake.write(
      target,
      data,
      mode: mode,
      storage_options: storage_options,
      **delta_write_options
    )
  end
end

#write_iceberg(target, mode:) ⇒ nil

Note:

This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.

Write DataFrame to an Iceberg table.

Parameters:

  • target (Object)

    Name of the table or the Table object representing an Iceberg table.

  • mode ('append', 'overwrite')

    How to handle existing data.

    • If 'append', will add new data.
    • If 'overwrite', will replace table with new data.

Returns:

  • (nil)


1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
# File 'lib/polars/data_frame.rb', line 1430

def write_iceberg(target, mode:)
  require "iceberg"

  table =
    if target.is_a?(Iceberg::Table)
      target
    else
      raise Todo
    end

  data = self

  if mode == "append"
    table.append(data)
  else
    raise Todo
  end
end

#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, credential_provider: "auto", retries: 2) ⇒ nil

Write to Arrow IPC binary stream or Feather file.

Parameters:

  • file (String)

    File path to which the file should be written.

  • compression ("uncompressed", "lz4", "zstd") (defaults to: "uncompressed")

    Compression method. Defaults to "uncompressed".

  • compat_level (Object) (defaults to: nil)

    Use a specific compatibility level when exporting Polars' internal data structures.

  • storage_options (Hash) (defaults to: nil)

    Options that indicate how to connect to a cloud provider.

    The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

    • aws
    • gcp
    • azure
    • Hugging Face (hf://): Accepts an API key under the token parameter: {'token': '...'}, or by setting the HF_TOKEN environment variable.

    If storage_options is not provided, Polars will try to infer the information from environment variables.

  • credential_provider (Object) (defaults to: "auto")

    Provide a function that can be called to provide cloud storage credentials. The function is expected to return a hash of credential keys along with an optional credential expiry time.

  • retries (Integer) (defaults to: 2)

    Number of retries if accessing a cloud instance fails.

Returns:

  • (nil)


1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
# File 'lib/polars/data_frame.rb', line 1139

def write_ipc(
  file,
  compression: "uncompressed",
  compat_level: nil,
  storage_options: nil,
  credential_provider: "auto",
  retries: 2
)
  return_bytes = file.nil?
  target = nil
  if file.nil?
    target = StringIO.new
    target.set_encoding(Encoding::BINARY)
  else
    target = file
  end

  lazy.sink_ipc(
    target,
    compression: compression,
    compat_level: compat_level,
    storage_options: storage_options,
    credential_provider: credential_provider,
    retries: retries
  )
  return_bytes ? target.string : nil
end

#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object

Write to Arrow IPC record batch stream.

See "Streaming format" in https://arrow.apache.org/docs/python/ipc.html.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3, 4, 5],
    "bar" => [6, 7, 8, 9, 10],
    "ham" => ["a", "b", "c", "d", "e"]
  }
)
df.write_ipc_stream("new_file.arrow")

Parameters:

  • file (Object)

    Path or writable file-like object to which the IPC record batch data will be written. If set to nil, the output is returned as a BytesIO object.

  • compression ('uncompressed', 'lz4', 'zstd') (defaults to: "uncompressed")

    Compression method. Defaults to "uncompressed".

  • compat_level (Object) (defaults to: nil)

    Use a specific compatibility level when exporting Polars' internal data structures.

Returns:



1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
# File 'lib/polars/data_frame.rb', line 1191

def write_ipc_stream(
  file,
  compression: "uncompressed",
  compat_level: nil
)
  return_bytes = file.nil?
  if return_bytes
    file = StringIO.new
    file.set_encoding(Encoding::BINARY)
  elsif Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  if compat_level.nil?
    compat_level = true
  end

  if compression.nil?
    compression = "uncompressed"
  end

  _df.write_ipc_stream(file, compression, compat_level)
  return_bytes ? file.string : nil
end

#write_json(file = nil) ⇒ nil

Serialize to JSON representation.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8]
  }
)
df.write_json
# => "[{\"foo\":1,\"bar\":6},{\"foo\":2,\"bar\":7},{\"foo\":3,\"bar\":8}]"

Parameters:

  • file (String) (defaults to: nil)

    File path to which the result should be written.

Returns:

  • (nil)


892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
# File 'lib/polars/data_frame.rb', line 892

def write_json(file = nil)
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end
  to_string_io = !file.nil? && file.is_a?(StringIO)
  if file.nil? || to_string_io
    buf = StringIO.new
    buf.set_encoding(Encoding::BINARY)
    _df.write_json(buf)
    json_bytes = buf.string

    json_str = json_bytes.force_encoding(Encoding::UTF_8)
    if to_string_io
      file.write(json_str)
    else
      return json_str
    end
  else
    _df.write_json(file)
  end
  nil
end

#write_ndjson(file = nil) ⇒ nil

Serialize to newline delimited JSON representation.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8]
  }
)
df.write_ndjson
# => "{\"foo\":1,\"bar\":6}\n{\"foo\":2,\"bar\":7}\n{\"foo\":3,\"bar\":8}\n"

Parameters:

  • file (String) (defaults to: nil)

    File path to which the result should be written.

Returns:

  • (nil)


931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
# File 'lib/polars/data_frame.rb', line 931

def write_ndjson(file = nil)
  should_return_buffer = false
  target = nil
  if file.nil?
    target = StringIO.new
    target.set_encoding(Encoding::BINARY)
    should_return_buffer = true
  elsif Utils.pathlike?(file)
    target = Utils.normalize_filepath(file)
  else
    target = file
  end

  lazy.sink_ndjson(
    target
  )

  if should_return_buffer
    return target.string.force_encoding(Encoding::UTF_8)
  end

  nil
end

#write_parquet(file, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, partition_by: nil, partition_chunk_size_bytes: 4_294_967_296, storage_options: nil, credential_provider: "auto", retries: 2, metadata: nil, mkdir: false) ⇒ nil

Write to Apache Parquet file.

Parameters:

  • file (String, Pathname, StringIO)

    File path to which the file should be written.

  • compression ("lz4", "uncompressed", "snappy", "gzip", "lzo", "brotli", "zstd") (defaults to: "zstd")

    Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression. Choose "snappy" for more backwards compatibility guarantees when you deal with older parquet readers.

  • compression_level (Integer, nil) (defaults to: nil)

    The level of compression to use. Higher compression means smaller files on disk.

    • "gzip" : min-level: 0, max-level: 10.
    • "brotli" : min-level: 0, max-level: 11.
    • "zstd" : min-level: 1, max-level: 22.
  • statistics (Boolean) (defaults to: true)

    Write statistics to the parquet headers. This requires extra compute.

  • row_group_size (Integer, nil) (defaults to: nil)

    Size of the row groups in number of rows. Defaults to 512^2 rows.

  • data_page_size (Integer, nil) (defaults to: nil)

    Size of the data page in bytes. Defaults to 1024^2 bytes.

Returns:

  • (nil)


1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
# File 'lib/polars/data_frame.rb', line 1240

def write_parquet(
  file,
  compression: "zstd",
  compression_level: nil,
  statistics: true,
  row_group_size: nil,
  data_page_size: nil,
  partition_by: nil,
  partition_chunk_size_bytes: 4_294_967_296,
  storage_options: nil,
  credential_provider: "auto",
  retries: 2,
  metadata: nil,
  mkdir: false
)
  if compression.nil?
    compression = "uncompressed"
  end
  if Utils.pathlike?(file)
    file = Utils.normalize_filepath(file)
  end

  target = file
  if !partition_by.nil?
    raise Todo
  end

  lazy.sink_parquet(
    target,
    compression: compression,
    compression_level: compression_level,
    statistics: statistics,
    row_group_size: row_group_size,
    data_page_size: data_page_size,
    storage_options: storage_options,
    credential_provider: credential_provider,
    retries: retries,
    metadata: ,
    mkdir: mkdir
  )
end