Class: Polars::LazyFrame

Inherits:

Object

Object
Polars::LazyFrame

show all

Defined in:: lib/polars/lazy_frame.rb

Overview

Representation of a Lazy computation graph/query against a DataFrame.

Class Method Summary collapse

.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.

Instance Method Summary collapse

#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
#cleared ⇒ LazyFrame
Create an empty copy of the current LazyFrame.
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
#columns ⇒ Array
Get or set column names.
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
#drop(columns) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop rows with null values from this LazyFrame.
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
#explode(columns) ⇒ LazyFrame
Explode lists to long format.
#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
#fill_nan(fill_value) ⇒ LazyFrame
Fill floating point NaN values.
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
#first ⇒ LazyFrame
Get the first row of the DataFrame.
#group_by(by, maintain_order: false) ⇒ LazyGroupBy (also: #groupby, #group)
Start a group by operation.
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window", check_sorted: true) ⇒ DataFrame (also: #groupby_dynamic)
Group based on a time value (or index value of type :i32, :i64).
#group_by_rolling(index_column:, period:, offset: nil, closed: "right", by: nil, check_sorted: true) ⇒ LazyFrame (also: #groupby_rolling)
Create rolling groups based on a time column.
#head(n = 5) ⇒ LazyFrame
Get the first n rows.
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame constructor
Create a new LazyFrame.
#interpolate ⇒ LazyFrame
Interpolate intermediate values.
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Add a join operation to the Logical Plan.
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Perform an asof join.
#last ⇒ LazyFrame
Get the last row of the DataFrame.
#lazy ⇒ LazyFrame
Return lazy representation, i.e.
#limit(n = 5) ⇒ LazyFrame
Get the first n rows.
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
#melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
Unpivot a DataFrame from wide to long format.
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
#rename(mapping) ⇒ LazyFrame
Rename column names.
#reverse ⇒ LazyFrame
Reverse the DataFrame.
#schema ⇒ Hash
Get the schema.
#select(exprs) ⇒ LazyFrame
Select columns from this DataFrame.
#set_sorted(column, *more_columns, descending: false) ⇒ LazyFrame
Indicate that one or multiple columns are sorted.
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true) ⇒ DataFrame
Persists a LazyFrame at the provided path.
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
#sort(by, reverse: false, nulls_last: false, maintain_order: false) ⇒ LazyFrame
Sort the DataFrame.
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
#tail(n = 5) ⇒ LazyFrame
Get the last n rows.
#take_every(n) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
#to_s ⇒ String
Returns a string representing the LazyFrame.
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
#unnest(names) ⇒ LazyFrame
Decompose a struct into its fields.
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
#width ⇒ Integer
Get the width of the LazyFrame.
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
#with_columns(exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
#with_row_count(name: "row_nr", offset: 0) ⇒ LazyFrame
Add a column at index 0 that counts the rows.
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.

Constructor Details

#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ `LazyFrame`

Create a new LazyFrame.

# File 'lib/polars/lazy_frame.rb', line 8

def initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false)
  self._ldf = (
    DataFrame.new(
      data,
      schema: schema,
      schema_overrides: schema_overrides,
      orient: orient,
      infer_schema_length: infer_schema_length,
      nan_to_null: nan_to_null
    )
    .lazy
    ._ldf
  )
end

Class Method Details

.read_json(file) ⇒ `LazyFrame`

Read a logical plan from a JSON file to construct a LazyFrame.

Parameters:

file (String) —
Path to a file or a file-like object.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 178

def self.read_json(file)
  if Utils.pathlike?(file)
    file = Utils.normalise_filepath(file)
  end

  Utils.wrap_ldf(RbLazyFrame.read_json(file))
end

Instance Method Details

#cache ⇒ `LazyFrame`

Cache the result once the execution of the physical plan hits this node.

Returns:

(LazyFrame)



698
699
700

# File 'lib/polars/lazy_frame.rb', line 698

def cache
  _from_rbldf(_ldf.cache)
end

#cleared ⇒ `LazyFrame`

Create an empty copy of the current LazyFrame.

The copy has an identical schema but no data.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil],
  }
).lazy
df.cleared.fetch
# =>
# shape: (0, 3)
# ┌─────┬─────┬──────┐
# │ a   ┆ b   ┆ c    │
# │ --- ┆ --- ┆ ---  │
# │ i64 ┆ f64 ┆ bool │
# ╞═════╪═════╪══════╡
# └─────┴─────┴──────┘

Returns:

(LazyFrame)



725
726
727

# File 'lib/polars/lazy_frame.rb', line 725

def cleared
  DataFrame.new(columns: schema).lazy
end

#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false, _eager: false) ⇒ `DataFrame`

Collect into a DataFrame.

Note: use #fetch if you want to run your query on the first n rows only. This can be a huge time saver in debugging queries.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.all.sum).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 4   ┆ 10  │
# │ b   ┆ 11  ┆ 10  │
# │ c   ┆ 6   ┆ 1   │
# └─────┴─────┴─────┘

Parameters:

type_coercion (Boolean) (defaults to: true) —
Do type coercion optimization.
predicate_pushdown (Boolean) (defaults to: true) —
Do predicate pushdown optimization.
projection_pushdown (Boolean) (defaults to: true) —
Do projection pushdown optimization.
simplify_expression (Boolean) (defaults to: true) —
Run simplify expressions optimization.
string_cache (Boolean) (defaults to: false) —
This argument is deprecated. Please set the string cache globally. The argument will be ignored
no_optimization (Boolean) (defaults to: false) —
Turn off (certain) optimizations.
slice_pushdown (Boolean) (defaults to: true) —
Slice pushdown optimization.
common_subplan_elimination (Boolean) (defaults to: true) —
Will try to cache branching subplans that occur on self-joins or unions.
allow_streaming (Boolean) (defaults to: false) —
Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:

(DataFrame)

# File 'lib/polars/lazy_frame.rb', line 463

def collect(
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  string_cache: false,
  no_optimization: false,
  slice_pushdown: true,
  common_subplan_elimination: true,
  allow_streaming: false,
  _eager: false
)
  if no_optimization
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
    common_subplan_elimination = false
  end

  if allow_streaming
    common_subplan_elimination = false
  end

  ldf = _ldf.optimization_toggle(
    type_coercion,
    predicate_pushdown,
    projection_pushdown,
    simplify_expression,
    slice_pushdown,
    common_subplan_elimination,
    allow_streaming,
    _eager
  )
  Utils.wrap_df(ldf.collect)
end

#columns ⇒ `Array`

Get or set column names.

Examples:

df = (
   Polars::DataFrame.new(
     {
       "foo" => [1, 2, 3],
       "bar" => [6, 7, 8],
       "ham" => ["a", "b", "c"]
     }
   )
   .lazy
   .select(["foo", "bar"])
)
df.columns
# => ["foo", "bar"]

Returns:

(Array)



204
205
206

# File 'lib/polars/lazy_frame.rb', line 204

def columns
  _ldf.columns
end

#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ `String`

Create a string representation of the optimized query plan.

Returns:

(String)

# File 'lib/polars/lazy_frame.rb', line 338

def describe_optimized_plan(
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  slice_pushdown: true,
  common_subplan_elimination: true,
  allow_streaming: false
)
  ldf = _ldf.optimization_toggle(
    type_coercion,
    predicate_pushdown,
    projection_pushdown,
    simplify_expression,
    slice_pushdown,
    common_subplan_elimination,
    allow_streaming,
    false
  )

  ldf.describe_optimized_plan
end

#describe_plan ⇒ `String`

Create a string representation of the unoptimized query plan.

Returns:

(String)



331
332
333

# File 'lib/polars/lazy_frame.rb', line 331

def describe_plan
  _ldf.describe_plan
end

#drop(columns) ⇒ `LazyFrame`

Remove one or multiple columns from a DataFrame.

Parameters:

columns (Object) —
- Name of the column that should be removed.
- List of column names.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1724

def drop(columns)
  if columns.is_a?(::String)
    columns = [columns]
  end
  _from_rbldf(_ldf.drop_columns(columns))
end

#drop_nulls(subset: nil) ⇒ `LazyFrame`

Drop rows with null values from this LazyFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, nil, 8],
    "ham" => ["a", "b", "c"]
  }
)
df.lazy.drop_nulls.collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 3   ┆ 8   ┆ c   │
# └─────┴─────┴─────┘

Parameters:

subset (Object) (defaults to: nil) —
Subset of column(s) on which drop_nulls will be applied.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2310

def drop_nulls(subset: nil)
  if !subset.nil? && !subset.is_a?(::Array)
    subset = [subset]
  end
  _from_rbldf(_ldf.drop_nulls(subset))
end

#dtypes ⇒ `Array`

Get dtypes of columns in LazyFrame.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.dtypes
# => [Polars::Int64, Polars::Float64, Polars::String]

Returns:

(Array)



222
223
224

# File 'lib/polars/lazy_frame.rb', line 222

def dtypes
  _ldf.dtypes
end

#explode(columns) ⇒ `LazyFrame`

Explode lists to long format.

Examples:

df = Polars::DataFrame.new(
  {
    "letters" => ["a", "a", "b", "c"],
    "numbers" => [[1], [2, 3], [4, 5], [6, 7, 8]],
  }
).lazy
df.explode("numbers").collect
# =>
# shape: (8, 2)
# ┌─────────┬─────────┐
# │ letters ┆ numbers │
# │ ---     ┆ ---     │
# │ str     ┆ i64     │
# ╞═════════╪═════════╡
# │ a       ┆ 1       │
# │ a       ┆ 2       │
# │ a       ┆ 3       │
# │ b       ┆ 4       │
# │ b       ┆ 5       │
# │ c       ┆ 6       │
# │ c       ┆ 7       │
# │ c       ┆ 8       │
# └─────────┴─────────┘

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2258

def explode(columns)
  columns = Utils.selection_to_rbexpr_list(columns)
  _from_rbldf(_ldf.explode(columns))
end

#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ `DataFrame`

Collect a small number of rows for debugging purposes.

Fetch is like a #collect operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.

Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.all.sum).fetch(2)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ a   ┆ 1   ┆ 6   │
# │ b   ┆ 2   ┆ 5   │
# └─────┴─────┴─────┘

Parameters:

n_rows (Integer) (defaults to: 500) —
Collect n_rows from the data sources.
type_coercion (Boolean) (defaults to: true) —
Run type coercion optimization.
predicate_pushdown (Boolean) (defaults to: true) —
Run predicate pushdown optimization.
projection_pushdown (Boolean) (defaults to: true) —
Run projection pushdown optimization.
simplify_expression (Boolean) (defaults to: true) —
Run simplify expressions optimization.
string_cache (Boolean) (defaults to: false) —
This argument is deprecated. Please set the string cache globally. The argument will be ignored
no_optimization (Boolean) (defaults to: false) —
Turn off optimizations.
slice_pushdown (Boolean) (defaults to: true) —
Slice pushdown optimization
common_subplan_elimination (Boolean) (defaults to: true) —
Will try to cache branching subplans that occur on self-joins or unions.
allow_streaming (Boolean) (defaults to: false) —
Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:

(DataFrame)

# File 'lib/polars/lazy_frame.rb', line 643

def fetch(
  n_rows = 500,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  string_cache: false,
  no_optimization: false,
  slice_pushdown: true,
  common_subplan_elimination: true,
  allow_streaming: false
)
  if no_optimization
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
    common_subplan_elimination = false
  end

  ldf = _ldf.optimization_toggle(
    type_coercion,
    predicate_pushdown,
    projection_pushdown,
    simplify_expression,
    slice_pushdown,
    common_subplan_elimination,
    allow_streaming,
    false
  )
  Utils.wrap_df(ldf.fetch(n_rows))
end

#fill_nan(fill_value) ⇒ `LazyFrame`

Note:

Note that floating point NaN (Not a Number) are not missing values! To replace missing values, use fill_null instead.

Fill floating point NaN values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1.5, 2, Float::NAN, 4],
    "b" => [0.5, 4, Float::NAN, 13],
  }
).lazy
df.fill_nan(99).collect
# =>
# shape: (4, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ f64  ┆ f64  │
# ╞══════╪══════╡
# │ 1.5  ┆ 0.5  │
# │ 2.0  ┆ 4.0  │
# │ 99.0 ┆ 99.0 │
# │ 4.0  ┆ 13.0 │
# └──────┴──────┘

Parameters:

fill_value (Object) —
Value to fill the NaN values with.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2033

def fill_nan(fill_value)
  if !fill_value.is_a?(Expr)
    fill_value = Utils.lit(fill_value)
  end
  _from_rbldf(_ldf.fill_nan(fill_value._rbexpr))
end

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ `LazyFrame`

Fill null values using the specified value or strategy.

Returns:

(LazyFrame)



1998
1999
2000

# File 'lib/polars/lazy_frame.rb', line 1998

def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil)
  select(Polars.all.fill_null(value, strategy: strategy, limit: limit))
end

#filter(predicate) ⇒ `LazyFrame`

Filter the rows in the DataFrame based on a predicate expression.

Examples:

Filter on one condition:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.filter(Polars.col("foo") < 3).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# │ 2   ┆ 7   ┆ b   │
# └─────┴─────┴─────┘

Filter on multiple conditions:

lf.filter((Polars.col("foo") < 3) & (Polars.col("ham") == "a")).collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6   ┆ a   │
# └─────┴─────┴─────┘

Parameters:

predicate (Object) —
Expression that evaluates to a boolean Series.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 767

def filter(predicate)
  _from_rbldf(
    _ldf.filter(
      Utils.expr_to_lit_or_expr(predicate, str_to_lit: false)._rbexpr
    )
  )
end

#first ⇒ `LazyFrame`

Get the first row of the DataFrame.

Returns:

(LazyFrame)



1934
1935
1936

# File 'lib/polars/lazy_frame.rb', line 1934

def first
  slice(0, 1)
end

#group_by(by, maintain_order: false) ⇒ `LazyGroupBy` Also known as: groupby, group

Start a group by operation.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["a", "b", "a", "b", "b", "c"],
    "b" => [1, 2, 3, 4, 5, 6],
    "c" => [6, 5, 4, 3, 2, 1]
  }
).lazy
df.group_by("a", maintain_order: true).agg(Polars.col("b").sum).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 4   │
# │ b   ┆ 11  │
# │ c   ┆ 6   │
# └─────┴─────┘

Parameters:

by (Object) —
Column(s) to group by.
maintain_order (Boolean) (defaults to: false) —
Make sure that the order of the groups remain consistent. This is more expensive than a default group by.

Returns:

(LazyGroupBy)

# File 'lib/polars/lazy_frame.rb', line 893

def group_by(by, maintain_order: false)
  rbexprs_by = Utils.selection_to_rbexpr_list(by)
  lgb = _ldf.group_by(rbexprs_by, maintain_order)
  LazyGroupBy.new(lgb)
end

#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window", check_sorted: true) ⇒ `DataFrame` Also known as: groupby_dynamic

Group based on a time value (or index value of type :i32, :i64).

Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

every: interval of the window
period: length of the window
offset: offset of the window

The every, period and offset arguments are created with the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_dynamic on an integer column, the windows are defined by:

"1i" # length 1
"10i" # length 10

Examples:

df = Polars::DataFrame.new(
  {
    "time" => Polars.date_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m"
    ),
    "n" => 0..6
  }
)
# =>
# shape: (7, 2)
# ┌─────────────────────┬─────┐
# │ time                ┆ n   │
# │ ---                 ┆ --- │
# │ datetime[μs]        ┆ i64 │
# ╞═════════════════════╪═════╡
# │ 2021-12-16 00:00:00 ┆ 0   │
# │ 2021-12-16 00:30:00 ┆ 1   │
# │ 2021-12-16 01:00:00 ┆ 2   │
# │ 2021-12-16 01:30:00 ┆ 3   │
# │ 2021-12-16 02:00:00 ┆ 4   │
# │ 2021-12-16 02:30:00 ┆ 5   │
# │ 2021-12-16 03:00:00 ┆ 6   │
# └─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

df.group_by_dynamic("time", every: "1h", closed: "right").agg(
  [
    Polars.col("time").min.alias("time_min"),
    Polars.col("time").max.alias("time_max")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬─────────────────────┬─────────────────────┐
# │ time                ┆ time_min            ┆ time_max            │
# │ ---                 ┆ ---                 ┆ ---                 │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result.

df.group_by_dynamic(
  "time", every: "1h", include_boundaries: true, closed: "right"
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (4, 4)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
# │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed="left", should not include right end of interval.

df.group_by_dynamic("time", every: "1h", closed: "left").agg(
  [
    Polars.col("time").count.alias("time_count"),
    Polars.col("time").alias("time_agg_list")
  ]
)
# =>
# shape: (4, 3)
# ┌─────────────────────┬────────────┬───────────────────────────────────┐
# │ time                ┆ time_count ┆ time_agg_list                     │
# │ ---                 ┆ ---        ┆ ---                               │
# │ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]                │
# ╞═════════════════════╪════════════╪═══════════════════════════════════╡
# │ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-16… │
# │ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-16… │
# │ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-16… │
# │ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]             │
# └─────────────────────┴────────────┴───────────────────────────────────┘

When closed="both" the time values at the window boundaries belong to 2 groups.

df.group_by_dynamic("time", every: "1h", closed: "both").agg(
  [Polars.col("time").count.alias("time_count")]
)
# =>
# shape: (5, 2)
# ┌─────────────────────┬────────────┐
# │ time                ┆ time_count │
# │ ---                 ┆ ---        │
# │ datetime[μs]        ┆ u32        │
# ╞═════════════════════╪════════════╡
# │ 2021-12-15 23:00:00 ┆ 1          │
# │ 2021-12-16 00:00:00 ┆ 3          │
# │ 2021-12-16 01:00:00 ┆ 3          │
# │ 2021-12-16 02:00:00 ┆ 3          │
# │ 2021-12-16 03:00:00 ┆ 1          │
# └─────────────────────┴────────────┘

Dynamic group bys can also be combined with grouping on normal keys.

df = Polars::DataFrame.new(
  {
    "time" => Polars.date_range(
      DateTime.new(2021, 12, 16),
      DateTime.new(2021, 12, 16, 3),
      "30m"
    ),
    "groups" => ["a", "a", "a", "b", "b", "a", "a"]
  }
)
df.group_by_dynamic(
  "time",
  every: "1h",
  closed: "both",
  by: "groups",
  include_boundaries: true
).agg([Polars.col("time").count.alias("time_count")])
# =>
# shape: (7, 5)
# ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
# │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
# │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
# │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
# ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
# │ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
# │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
# │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
# │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
# │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
# │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
# └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic group by on an index column.

df = Polars::DataFrame.new(
  {
    "idx" => Polars.arange(0, 6, eager: true),
    "A" => ["A", "A", "B", "B", "B", "C"]
  }
)
df.group_by_dynamic(
  "idx",
  every: "2i",
  period: "3i",
  include_boundaries: true,
  closed: "right"
).agg(Polars.col("A").alias("A_agg_list"))
# =>
# shape: (3, 4)
# ┌─────────────────┬─────────────────┬─────┬─────────────────┐
# │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
# │ ---             ┆ ---             ┆ --- ┆ ---             │
# │ i64             ┆ i64             ┆ i64 ┆ list[str]       │
# ╞═════════════════╪═════════════════╪═════╪═════════════════╡
# │ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
# │ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
# │ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
# └─────────────────┴─────────────────┴─────┴─────────────────┘

Parameters:

index_column (Object) —
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

In case of a dynamic group by on indices, dtype needs to be one of :i32, :i64. Note that :i32 gets temporarily cast to :i64, so if performance matters use an :i64 column.
every (Object) —
Interval of the window.
period (Object) (defaults to: nil) —
Length of the window, if None it is equal to 'every'.
offset (Object) (defaults to: nil) —
Offset of the window if None and period is None it will be equal to negative every.
truncate (Boolean) (defaults to: nil) —
Truncate the time value to the window lower bound.
include_boundaries (Boolean) (defaults to: false) —
Add the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize
closed ("right", "left", "both", "none") (defaults to: "left") —
Define whether the temporal window interval is closed or not.
by (Object) (defaults to: nil) —
Also group by this column/these columns
check_sorted (Boolean) (defaults to: true) —
When the by argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to false. Doing so incorrectly will lead to incorrect output.

Returns:

(DataFrame)

# File 'lib/polars/lazy_frame.rb', line 1247

def group_by_dynamic(
  index_column,
  every:,
  period: nil,
  offset: nil,
  truncate: nil,
  include_boundaries: false,
  closed: "left",
  label: "left",
  by: nil,
  start_by: "window",
  check_sorted: true
)
  if !truncate.nil?
    label = truncate ? "left" : "datapoint"
  end

  index_column = Utils.expr_to_lit_or_expr(index_column, str_to_lit: false)
  if offset.nil?
    offset = period.nil? ? "-#{every}" : "0ns"
  end

  if period.nil?
    period = every
  end

  period = Utils._timedelta_to_pl_duration(period)
  offset = Utils._timedelta_to_pl_duration(offset)
  every = Utils._timedelta_to_pl_duration(every)

  rbexprs_by = by.nil? ? [] : Utils.selection_to_rbexpr_list(by)
  lgb = _ldf.group_by_dynamic(
    index_column._rbexpr,
    every,
    period,
    offset,
    label,
    include_boundaries,
    closed,
    rbexprs_by,
    start_by,
    check_sorted
  )
  LazyGroupBy.new(lgb)
end

#group_by_rolling(index_column:, period:, offset: nil, closed: "right", by: nil, check_sorted: true) ⇒ `LazyFrame` Also known as: groupby_rolling

Create rolling groups based on a time column.

Also works for index values of type :i32 or :i64.

Different from a dynamic_group_by the windows are now determined by the individual values and are not of constant intervals. For constant intervals use group_by_dynamic.

The period and offset arguments are created either from a timedelta, or by using the following string language:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)

Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a group_by_rolling on an integer column, the windows are defined by:

"1i" # length 1
"10i" # length 10

Examples:

dates = [
  "2020-01-01 13:45:48",
  "2020-01-01 16:42:13",
  "2020-01-01 16:45:09",
  "2020-01-02 18:12:48",
  "2020-01-03 19:45:32",
  "2020-01-08 23:16:43"
]
df = Polars::LazyFrame.new({"dt" => dates, "a" => [3, 7, 5, 9, 2, 1]}).with_column(
  Polars.col("dt").str.strptime(Polars::Datetime).set_sorted
)
df.group_by_rolling(index_column: "dt", period: "2d").agg(
  [
    Polars.sum("a").alias("sum_a"),
    Polars.min("a").alias("min_a"),
    Polars.max("a").alias("max_a")
  ]
).collect
# =>
# shape: (6, 4)
# ┌─────────────────────┬───────┬───────┬───────┐
# │ dt                  ┆ sum_a ┆ min_a ┆ max_a │
# │ ---                 ┆ ---   ┆ ---   ┆ ---   │
# │ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
# ╞═════════════════════╪═══════╪═══════╪═══════╡
# │ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
# │ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
# │ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
# │ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
# │ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
# │ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
# └─────────────────────┴───────┴───────┴───────┘

Parameters:

index_column (Object) —
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

In case of a rolling group by on indices, dtype needs to be one of :i32, :i64. Note that :i32 gets temporarily cast to :i64, so if performance matters use an :i64 column.
period (Object) —
Length of the window.
offset (Object) (defaults to: nil) —
Offset of the window. Default is -period.
closed ("right", "left", "both", "none") (defaults to: "right") —
Define whether the temporal window interval is closed or not.
by (Object) (defaults to: nil) —
Also group by this column/these columns.
check_sorted (Boolean) (defaults to: true) —
When the by argument is given, polars can not check sortedness by the metadata and has to do a full scan on the index column to verify data is sorted. This is expensive. If you are sure the data within the by groups is sorted, you can set this to false. Doing so incorrectly will lead to incorrect output

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 991

def group_by_rolling(
  index_column:,
  period:,
  offset: nil,
  closed: "right",
  by: nil,
  check_sorted: true
)
  index_column = Utils.parse_as_expression(index_column)
  if offset.nil?
    offset = "-#{period}"
  end

  rbexprs_by = by.nil? ? [] : Utils.selection_to_rbexpr_list(by)
  period = Utils._timedelta_to_pl_duration(period)
  offset = Utils._timedelta_to_pl_duration(offset)

  lgb = _ldf.group_by_rolling(
    index_column, period, offset, closed, rbexprs_by, check_sorted
  )
  LazyGroupBy.new(lgb)
end

#head(n = 5) ⇒ `LazyFrame`

Note:

Consider using the #fetch operation if you only want to test your query. The #fetch operation will load the first n rows at the scan level, whereas the #head/#limit are applied at the end.

Get the first n rows.

Parameters:

n (Integer) (defaults to: 5) —
Number of rows to return.

Returns:

(LazyFrame)



1910
1911
1912

# File 'lib/polars/lazy_frame.rb', line 1910

def head(n = 5)
  slice(0, n)
end

#include?(key) ⇒ `Boolean`

Check if LazyFrame includes key.

Returns:

(Boolean)



259
260
261

# File 'lib/polars/lazy_frame.rb', line 259

def include?(key)
  columns.include?(key)
end

#interpolate ⇒ `LazyFrame`

Interpolate intermediate values. The interpolation method is linear.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, nil, 9, 10],
    "bar" => [6, 7, 9, nil],
    "baz" => [1, nil, nil, 9]
  }
).lazy
df.interpolate.collect
# =>
# shape: (4, 3)
# ┌──────┬──────┬──────────┐
# │ foo  ┆ bar  ┆ baz      │
# │ ---  ┆ ---  ┆ ---      │
# │ f64  ┆ f64  ┆ f64      │
# ╞══════╪══════╪══════════╡
# │ 1.0  ┆ 6.0  ┆ 1.0      │
# │ 5.0  ┆ 7.0  ┆ 3.666667 │
# │ 9.0  ┆ 9.0  ┆ 6.333333 │
# │ 10.0 ┆ null ┆ 9.0      │
# └──────┴──────┴──────────┘

Returns:

(LazyFrame)



2411
2412
2413

# File 'lib/polars/lazy_frame.rb', line 2411

def interpolate
  select(Utils.col("*").interpolate)
end

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", allow_parallel: true, force_parallel: false) ⇒ `LazyFrame`

Add a join operation to the Logical Plan.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
other_df = Polars::DataFrame.new(
  {
    "apple" => ["x", "y", "z"],
    "ham" => ["a", "b", "d"]
  }
).lazy
df.join(other_df, on: "ham").collect
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# └─────┴─────┴─────┴───────┘

df.join(other_df, on: "ham", how: "outer").collect
# =>
# shape: (4, 5)
# ┌──────┬──────┬──────┬───────┬───────────┐
# │ foo  ┆ bar  ┆ ham  ┆ apple ┆ ham_right │
# │ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---       │
# │ i64  ┆ f64  ┆ str  ┆ str   ┆ str       │
# ╞══════╪══════╪══════╪═══════╪═══════════╡
# │ 1    ┆ 6.0  ┆ a    ┆ x     ┆ a         │
# │ 2    ┆ 7.0  ┆ b    ┆ y     ┆ b         │
# │ null ┆ null ┆ null ┆ z     ┆ d         │
# │ 3    ┆ 8.0  ┆ c    ┆ null  ┆ null      │
# └──────┴──────┴──────┴───────┴───────────┘

df.join(other_df, on: "ham", how: "left").collect
# =>
# shape: (3, 4)
# ┌─────┬─────┬─────┬───────┐
# │ foo ┆ bar ┆ ham ┆ apple │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ f64 ┆ str ┆ str   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 6.0 ┆ a   ┆ x     │
# │ 2   ┆ 7.0 ┆ b   ┆ y     │
# │ 3   ┆ 8.0 ┆ c   ┆ null  │
# └─────┴─────┴─────┴───────┘

df.join(other_df, on: "ham", how: "semi").collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 6.0 ┆ a   │
# │ 2   ┆ 7.0 ┆ b   │
# └─────┴─────┴─────┘

df.join(other_df, on: "ham", how: "anti").collect
# =>
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# └─────┴─────┴─────┘

Parameters:

other (LazyFrame) —
Lazy DataFrame to join with.
left_on (Object) (defaults to: nil) —
Join column of the left DataFrame.
right_on (Object) (defaults to: nil) —
Join column of the right DataFrame.
on (defaults to: nil) —
Object Join column of both DataFrames. If set, left_on and right_on should be None.
how ("inner", "left", "outer", "semi", "anti", "cross") (defaults to: "inner") —
Join strategy.
suffix (String) (defaults to: "_right") —
Suffix to append to columns with a duplicate name.
allow_parallel (Boolean) (defaults to: true) —
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
force_parallel (Boolean) (defaults to: false) —
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1531

def join(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  how: "inner",
  suffix: "_right",
  allow_parallel: true,
  force_parallel: false
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if how == "cross"
    return _from_rbldf(
      _ldf.join(
        other._ldf, [], [], allow_parallel, force_parallel, how, suffix
      )
    )
  end

  if !on.nil?
    rbexprs = Utils.selection_to_rbexpr_list(on)
    rbexprs_left = rbexprs
    rbexprs_right = rbexprs
  elsif !left_on.nil? && !right_on.nil?
    rbexprs_left = Utils.selection_to_rbexpr_list(left_on)
    rbexprs_right = Utils.selection_to_rbexpr_list(right_on)
  else
    raise ArgumentError, "must specify `on` OR `left_on` and `right_on`"
  end

  _from_rbldf(
    self._ldf.join(
      other._ldf,
      rbexprs_left,
      rbexprs_right,
      allow_parallel,
      force_parallel,
      how,
      suffix,
    )
  )
end

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ `LazyFrame`

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the join_asof key.

For each row in the left DataFrame:

A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

The default is "backward".

Parameters:

other (LazyFrame) —
Lazy DataFrame to join with.
left_on (String) (defaults to: nil) —
Join column of the left DataFrame.
right_on (String) (defaults to: nil) —
Join column of the right DataFrame.
on (String) (defaults to: nil) —
Join column of both DataFrames. If set, left_on and right_on should be None.
by (Object) (defaults to: nil) —
Join on these columns before doing asof join.
by_left (Object) (defaults to: nil) —
Join on these columns before doing asof join.
by_right (Object) (defaults to: nil) —
Join on these columns before doing asof join.
strategy ("backward", "forward") (defaults to: "backward") —
Join strategy.
suffix (String) (defaults to: "_right") —
Suffix to append to columns with a duplicate name.
tolerance (Object) (defaults to: nil) —
Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time" you use the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
allow_parallel (Boolean) (defaults to: true) —
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
force_parallel (Boolean) (defaults to: false) —
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1356

def join_asof(
  other,
  left_on: nil,
  right_on: nil,
  on: nil,
  by_left: nil,
  by_right: nil,
  by: nil,
  strategy: "backward",
  suffix: "_right",
  tolerance: nil,
  allow_parallel: true,
  force_parallel: false
)
  if !other.is_a?(LazyFrame)
    raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}"
  end

  if on.is_a?(::String)
    left_on = on
    right_on = on
  end

  if left_on.nil? || right_on.nil?
    raise ArgumentError, "You should pass the column to join on as an argument."
  end

  if by_left.is_a?(::String) || by_left.is_a?(Expr)
    by_left_ = [by_left]
  else
    by_left_ = by_left
  end

  if by_right.is_a?(::String) || by_right.is_a?(Expr)
    by_right_ = [by_right]
  else
    by_right_ = by_right
  end

  if by.is_a?(::String)
    by_left_ = [by]
    by_right_ = [by]
  elsif by.is_a?(::Array)
    by_left_ = by
    by_right_ = by
  end

  tolerance_str = nil
  tolerance_num = nil
  if tolerance.is_a?(::String)
    tolerance_str = tolerance
  else
    tolerance_num = tolerance
  end

  _from_rbldf(
    _ldf.join_asof(
      other._ldf,
      Polars.col(left_on)._rbexpr,
      Polars.col(right_on)._rbexpr,
      by_left_,
      by_right_,
      allow_parallel,
      force_parallel,
      suffix,
      strategy,
      tolerance_num,
      tolerance_str
    )
  )
end

#last ⇒ `LazyFrame`

Get the last row of the DataFrame.

Returns:

(LazyFrame)



1927
1928
1929

# File 'lib/polars/lazy_frame.rb', line 1927

def last
  tail(1)
end

#lazy ⇒ `LazyFrame`

Return lazy representation, i.e. itself.

Useful for writing code that expects either a DataFrame or LazyFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [nil, 2, 3, 4],
    "b" => [0.5, nil, 2.5, 13],
    "c" => [true, true, false, nil]
  }
)
df.lazy

Returns:

(LazyFrame)



691
692
693

# File 'lib/polars/lazy_frame.rb', line 691

def lazy
  self
end

#limit(n = 5) ⇒ `LazyFrame`

Note:

Consider using the #fetch operation if you only want to test your query. The #fetch operation will load the first n rows at the scan level, whereas the #head/#limit are applied at the end.

Get the first n rows.

Alias for #head.

Parameters:

n (Integer) (defaults to: 5) —
Number of rows to return.

Returns:

(LazyFrame)



1895
1896
1897

# File 'lib/polars/lazy_frame.rb', line 1895

def limit(n = 5)
  head(5)
end

#max ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their maximum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.max.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 4   ┆ 2   │
# └─────┴─────┘

Returns:

(LazyFrame)



2120
2121
2122

# File 'lib/polars/lazy_frame.rb', line 2120

def max
  _from_rbldf(_ldf.max)
end

#mean ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their mean value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.mean.collect
# =>
# shape: (1, 2)
# ┌─────┬──────┐
# │ a   ┆ b    │
# │ --- ┆ ---  │
# │ f64 ┆ f64  │
# ╞═════╪══════╡
# │ 2.5 ┆ 1.25 │
# └─────┴──────┘

Returns:

(LazyFrame)



2180
2181
2182

# File 'lib/polars/lazy_frame.rb', line 2180

def mean
  _from_rbldf(_ldf.mean)
end

#median ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their median value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.median.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 2.5 ┆ 1.0 │
# └─────┴─────┘

Returns:

(LazyFrame)



2200
2201
2202

# File 'lib/polars/lazy_frame.rb', line 2200

def median
  _from_rbldf(_ldf.median)
end

#melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ `LazyFrame`

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are "unpivoted" to the row axis, leaving just two non-identifier columns, 'variable' and 'value'.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
).lazy
df.melt(id_vars: "a", value_vars: ["b", "c"]).collect
# =>
# shape: (6, 3)
# ┌─────┬──────────┬───────┐
# │ a   ┆ variable ┆ value │
# │ --- ┆ ---      ┆ ---   │
# │ str ┆ str      ┆ i64   │
# ╞═════╪══════════╪═══════╡
# │ x   ┆ b        ┆ 1     │
# │ y   ┆ b        ┆ 3     │
# │ z   ┆ b        ┆ 5     │
# │ x   ┆ c        ┆ 2     │
# │ y   ┆ c        ┆ 4     │
# │ z   ┆ c        ┆ 6     │
# └─────┴──────────┴───────┘

Parameters:

id_vars (Object) (defaults to: nil) —
Columns to use as identifier variables.
value_vars (Object) (defaults to: nil) —
Values to use as identifier variables. If value_vars is empty all columns that are not in id_vars will be used.
variable_name (String) (defaults to: nil) —
Name to give to the value column. Defaults to "variable"
value_name (String) (defaults to: nil) —
Name to give to the value column. Defaults to "value"
streamable (Boolean) (defaults to: true) —
Allow this node to run in the streaming engine. If this runs in streaming, the output of the melt operation will not have a stable ordering.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2365

def melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true)
  if value_vars.is_a?(::String)
    value_vars = [value_vars]
  end
  if id_vars.is_a?(::String)
    id_vars = [id_vars]
  end
  if value_vars.nil?
    value_vars = []
  end
  if id_vars.nil?
    id_vars = []
  end
  _from_rbldf(
    _ldf.melt(id_vars, value_vars, value_name, variable_name, streamable)
  )
end

#min ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their minimum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.min.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 1   │
# └─────┴─────┘

Returns:

(LazyFrame)



2140
2141
2142

# File 'lib/polars/lazy_frame.rb', line 2140

def min
  _from_rbldf(_ldf.min)
end

#pipe(func, *args, **kwargs, &block) ⇒ `LazyFrame`

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Examples:

cast_str_to_int = lambda do |data, col_name:|
  data.with_column(Polars.col(col_name).cast(:i64))
end

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => ["10", "20", "30", "40"]}).lazy
df.pipe(cast_str_to_int, col_name: "b").collect()
# =>
# shape: (4, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 10  │
# │ 2   ┆ 20  │
# │ 3   ┆ 30  │
# │ 4   ┆ 40  │
# └─────┴─────┘

Parameters:

func (Object) —
Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
args (Object) —
Arguments to pass to the UDF.
kwargs (Object) —
Keyword arguments to pass to the UDF.

Returns:

(LazyFrame)



324
325
326

# File 'lib/polars/lazy_frame.rb', line 324

def pipe(func, *args, **kwargs, &block)
  func.call(self, *args, **kwargs, &block)
end

#quantile(quantile, interpolation: "nearest") ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their quantile value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.quantile(0.7).collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ f64 ┆ f64 │
# ╞═════╪═════╡
# │ 3.0 ┆ 1.0 │
# └─────┴─────┘

Parameters:

quantile (Float) —
Quantile between 0.0 and 1.0.
interpolation ("nearest", "higher", "lower", "midpoint", "linear") (defaults to: "nearest") —
Interpolation method.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2225

def quantile(quantile, interpolation: "nearest")
  quantile = Utils.expr_to_lit_or_expr(quantile, str_to_lit: false)
  _from_rbldf(_ldf.quantile(quantile._rbexpr, interpolation))
end

#rename(mapping) ⇒ `LazyFrame`

Rename column names.

Parameters:

mapping (Hash) —
Key value pairs that map from old name to new name.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1737

def rename(mapping)
  existing = mapping.keys
  _new = mapping.values
  _from_rbldf(_ldf.rename(existing, _new))
end

#reverse ⇒ `LazyFrame`

Reverse the DataFrame.

Returns:

(LazyFrame)



1746
1747
1748

# File 'lib/polars/lazy_frame.rb', line 1746

def reverse
  _from_rbldf(_ldf.reverse)
end

#schema ⇒ `Hash`

Get the schema.

Examples:

lf = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
lf.schema
# => {"foo"=>Polars::Int64, "bar"=>Polars::Float64, "ham"=>Polars::String}

Returns:

(Hash)



240
241
242

# File 'lib/polars/lazy_frame.rb', line 240

def schema
  _ldf.schema
end

#select(exprs) ⇒ `LazyFrame`

Select columns from this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6, 7, 8],
    "ham" => ["a", "b", "c"],
  }
).lazy
df.select("foo").collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 2   │
# │ 3   │
# └─────┘

df.select(["foo", "bar"]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 6   │
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# └─────┴─────┘

df.select(Polars.col("foo") + 1).collect
# =>
# shape: (3, 1)
# ┌─────┐
# │ foo │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 2   │
# │ 3   │
# │ 4   │
# └─────┘

df.select([Polars.col("foo") + 1, Polars.col("bar") + 1]).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ foo ┆ bar │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 7   │
# │ 3   ┆ 8   │
# │ 4   ┆ 9   │
# └─────┴─────┘

df.select(Polars.when(Polars.col("foo") > 2).then(10).otherwise(0)).collect
# =>
# shape: (3, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i64     │
# ╞═════════╡
# │ 0       │
# │ 0       │
# │ 10      │
# └─────────┘

Parameters:

exprs (Object) —
Column or columns to select.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 858

def select(exprs)
  exprs = Utils.selection_to_rbexpr_list(exprs)
  _from_rbldf(_ldf.select(exprs))
end

#set_sorted(column, *more_columns, descending: false) ⇒ `LazyFrame`

Indicate that one or multiple columns are sorted.

Parameters:

column (Object) —
Columns that are sorted
more_columns (Object) —
Additional columns that are sorted, specified as positional arguments.
descending (Boolean) (defaults to: false) —
Whether the columns are sorted in descending order.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2487

def set_sorted(
  column,
  *more_columns,
  descending: false
)
  columns = Utils.selection_to_rbexpr_list(column)
  if more_columns.any?
    columns.concat(Utils.selection_to_rbexpr_list(more_columns))
  end
  with_columns(
    columns.map { |e| Utils.wrap_expr(e).set_sorted(descending: descending) }
  )
end

#shift(n, fill_value: nil) ⇒ `LazyFrame`

Shift the values by a given period.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.shift(1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ null ┆ null │
# │ 1    ┆ 2    │
# │ 3    ┆ 4    │
# └──────┴──────┘

df.shift(-1).collect
# =>
# shape: (3, 2)
# ┌──────┬──────┐
# │ a    ┆ b    │
# │ ---  ┆ ---  │
# │ i64  ┆ i64  │
# ╞══════╪══════╡
# │ 3    ┆ 4    │
# │ 5    ┆ 6    │
# │ null ┆ null │
# └──────┴──────┘

Parameters:

n (Integer) —
Number of places to shift (may be negative).
fill_value (Object) (defaults to: nil) —
Fill the resulting null values with this value.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1792

def shift(n, fill_value: nil)
  if !fill_value.nil?
    fill_value = Utils.parse_as_expression(fill_value, str_as_lit: true)
  end
  n = Utils.parse_as_expression(n)
  _from_rbldf(_ldf.shift(n, fill_value))
end

#shift_and_fill(periods, fill_value) ⇒ `LazyFrame`

Shift the values by a given period and fill the resulting null values.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.shift_and_fill(1, 0).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 0   ┆ 0   │
# │ 1   ┆ 2   │
# │ 3   ┆ 4   │
# └─────┴─────┘

df.shift_and_fill(-1, 0).collect
# =>
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 3   ┆ 4   │
# │ 5   ┆ 6   │
# │ 0   ┆ 0   │
# └─────┴─────┘

Parameters:

periods (Integer) —
Number of places to shift (may be negative).
fill_value (Object) —
Fill nil values with the result of this expression.

Returns:

(LazyFrame)



1842
1843
1844

# File 'lib/polars/lazy_frame.rb', line 1842

def shift_and_fill(periods, fill_value)
  shift(periods, fill_value: fill_value)
end

#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true) ⇒ `DataFrame`

Persists a LazyFrame at the provided path.

This allows streaming results that are larger than RAM to be written to disk.

Examples:

lf = Polars.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_parquet("out.parquet")

Parameters:

path (String) —
File path to which the file should be written.
compression ("lz4", "uncompressed", "snappy", "gzip", "lzo", "brotli", "zstd") (defaults to: "zstd") —
Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression. Choose "snappy" for more backwards compatibility guarantees when you deal with older parquet readers.
compression_level (Integer) (defaults to: nil) —
The level of compression to use. Higher compression means smaller files on disk.
- "gzip" : min-level: 0, max-level: 10.
- "brotli" : min-level: 0, max-level: 11.
- "zstd" : min-level: 1, max-level: 22.
statistics (Boolean) (defaults to: false) —
Write statistics to the parquet headers. This requires extra compute.
row_group_size (Integer) (defaults to: nil) —
Size of the row groups in number of rows. If nil (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.
data_pagesize_limit (Integer) (defaults to: nil) —
Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes
maintain_order (Boolean) (defaults to: true) —
Maintain the order in which data is processed. Setting this to false will be slightly faster.
type_coercion (Boolean) (defaults to: true) —
Do type coercion optimization.
predicate_pushdown (Boolean) (defaults to: true) —
Do predicate pushdown optimization.
projection_pushdown (Boolean) (defaults to: true) —
Do projection pushdown optimization.
simplify_expression (Boolean) (defaults to: true) —
Run simplify expressions optimization.
no_optimization (Boolean) (defaults to: false) —
Turn off (certain) optimizations.
slice_pushdown (Boolean) (defaults to: true) —
Slice pushdown optimization.

Returns:

(DataFrame)

# File 'lib/polars/lazy_frame.rb', line 548

def sink_parquet(
  path,
  compression: "zstd",
  compression_level: nil,
  statistics: false,
  row_group_size: nil,
  data_pagesize_limit: nil,
  maintain_order: true,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  no_optimization: false,
  slice_pushdown: true
)
  if no_optimization
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
  end

  lf = _ldf.optimization_toggle(
    type_coercion,
    predicate_pushdown,
    projection_pushdown,
    simplify_expression,
    slice_pushdown,
    false,
    true,
    false
  )
  lf.sink_parquet(
    path,
    compression,
    compression_level,
    statistics,
    row_group_size,
    data_pagesize_limit,
    maintain_order
  )
end

#slice(offset, length = nil) ⇒ `LazyFrame`

Get a slice of this DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => ["x", "y", "z"],
    "b" => [1, 3, 5],
    "c" => [2, 4, 6]
  }
).lazy
df.slice(1, 2).collect
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ y   ┆ 3   ┆ 4   │
# │ z   ┆ 5   ┆ 6   │
# └─────┴─────┴─────┘

Parameters:

offset (Integer) —
Start index. Negative indexing is supported.
length (Integer) (defaults to: nil) —
Length of the slice. If set to nil, all rows starting at the offset will be selected.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1875

def slice(offset, length = nil)
  if length && length < 0
    raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame"
  end
  _from_rbldf(_ldf.slice(offset, length))
end

#sort(by, reverse: false, nulls_last: false, maintain_order: false) ⇒ `LazyFrame`

Sort the DataFrame.

Sorting can be done by:

A single column name
An expression
Multiple expressions

Examples:

df = Polars::DataFrame.new(
  {
    "foo" => [1, 2, 3],
    "bar" => [6.0, 7.0, 8.0],
    "ham" => ["a", "b", "c"]
  }
).lazy
df.sort("foo", reverse: true).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ foo ┆ bar ┆ ham │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str │
# ╞═════╪═════╪═════╡
# │ 3   ┆ 8.0 ┆ c   │
# │ 2   ┆ 7.0 ┆ b   │
# │ 1   ┆ 6.0 ┆ a   │
# └─────┴─────┴─────┘

Parameters:

by (Object) —
Column (expressions) to sort by.
reverse (Boolean) (defaults to: false) —
Sort in descending order.
nulls_last (Boolean) (defaults to: false) —
Place null values last. Can only be used if sorted by a single column.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 401

def sort(by, reverse: false, nulls_last: false, maintain_order: false)
  if by.is_a?(::String)
    return _from_rbldf(_ldf.sort(by, reverse, nulls_last, maintain_order))
  end
  if Utils.bool?(reverse)
    reverse = [reverse]
  end

  by = Utils.selection_to_rbexpr_list(by)
  _from_rbldf(_ldf.sort_by_exprs(by, reverse, nulls_last, maintain_order))
end

#std(ddof: 1) ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their standard deviation value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.std.collect
# =>
# shape: (1, 2)
# ┌──────────┬─────┐
# │ a        ┆ b   │
# │ ---      ┆ --- │
# │ f64      ┆ f64 │
# ╞══════════╪═════╡
# │ 1.290994 ┆ 0.5 │
# └──────────┴─────┘

df.std(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────────┬──────────┐
# │ a        ┆ b        │
# │ ---      ┆ ---      │
# │ f64      ┆ f64      │
# ╞══════════╪══════════╡
# │ 1.118034 ┆ 0.433013 │
# └──────────┴──────────┘

Returns:

(LazyFrame)



2068
2069
2070

# File 'lib/polars/lazy_frame.rb', line 2068

def std(ddof: 1)
  _from_rbldf(_ldf.std(ddof))
end

#sum ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their sum value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.sum.collect
# =>
# shape: (1, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 10  ┆ 5   │
# └─────┴─────┘

Returns:

(LazyFrame)



2160
2161
2162

# File 'lib/polars/lazy_frame.rb', line 2160

def sum
  _from_rbldf(_ldf.sum)
end

#tail(n = 5) ⇒ `LazyFrame`

Get the last n rows.

Parameters:

n (Integer) (defaults to: 5) —
Number of rows.

Returns:

(LazyFrame)



1920
1921
1922

# File 'lib/polars/lazy_frame.rb', line 1920

def tail(n = 5)
  _from_rbldf(_ldf.tail(n))
end

#take_every(n) ⇒ `LazyFrame`

Take every nth row in the LazyFrame and return as a new LazyFrame.

Examples:

s = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [5, 6, 7, 8]}).lazy
s.take_every(2).collect
# =>
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 5   │
# │ 3   ┆ 7   │
# └─────┴─────┘

Returns:

(LazyFrame)



1991
1992
1993

# File 'lib/polars/lazy_frame.rb', line 1991

def take_every(n)
  select(Utils.col("*").take_every(n))
end

#to_s ⇒ `String`

Returns a string representing the LazyFrame.

Returns:

(String)

# File 'lib/polars/lazy_frame.rb', line 271

def to_s
  <<~EOS
    naive plan: (run LazyFrame#describe_optimized_plan to see the optimized plan)

    #{describe_plan}
  EOS
end

#unique(maintain_order: true, subset: nil, keep: "first") ⇒ `LazyFrame`

Drop duplicate rows from this DataFrame.

Note that this fails if there is a column of type List in the DataFrame or subset.

Parameters:

maintain_order (Boolean) (defaults to: true) —
Keep the same order as the original DataFrame. This requires more work to compute.
subset (Object) (defaults to: nil) —
Subset to use to compare rows.
keep ("first", "last") (defaults to: "first") —
Which of the duplicate rows to keep.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2277

def unique(maintain_order: true, subset: nil, keep: "first")
  if !subset.nil? && !subset.is_a?(::Array)
    subset = [subset]
  end
  _from_rbldf(_ldf.unique(maintain_order, subset, keep))
end

#unnest(names) ⇒ `LazyFrame`

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Examples:

df = (
  Polars::DataFrame.new(
    {
      "before" => ["foo", "bar"],
      "t_a" => [1, 2],
      "t_b" => ["a", "b"],
      "t_c" => [true, nil],
      "t_d" => [[1, 2], [3]],
      "after" => ["baz", "womp"]
    }
  )
  .lazy
  .select(
    ["before", Polars.struct(Polars.col("^t_.$")).alias("t_struct"), "after"]
  )
)
df.fetch
# =>
# shape: (2, 3)
# ┌────────┬─────────────────────┬───────┐
# │ before ┆ t_struct            ┆ after │
# │ ---    ┆ ---                 ┆ ---   │
# │ str    ┆ struct[4]           ┆ str   │
# ╞════════╪═════════════════════╪═══════╡
# │ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
# │ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
# └────────┴─────────────────────┴───────┘

df.unnest("t_struct").fetch
# =>
# shape: (2, 6)
# ┌────────┬─────┬─────┬──────┬───────────┬───────┐
# │ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
# │ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
# │ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
# ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
# │ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
# │ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
# └────────┴─────┴─────┴──────┴───────────┴───────┘

Parameters:

names (Object) —
Names of the struct columns that will be decomposed by its fields

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 2466

def unnest(names)
  if names.is_a?(::String)
    names = [names]
  end
  _from_rbldf(_ldf.unnest(names))
end

#var(ddof: 1) ⇒ `LazyFrame`

Aggregate the columns in the DataFrame to their variance value.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4], "b" => [1, 2, 1, 1]}).lazy
df.var.collect
# =>
# shape: (1, 2)
# ┌──────────┬──────┐
# │ a        ┆ b    │
# │ ---      ┆ ---  │
# │ f64      ┆ f64  │
# ╞══════════╪══════╡
# │ 1.666667 ┆ 0.25 │
# └──────────┴──────┘

df.var(ddof: 0).collect
# =>
# shape: (1, 2)
# ┌──────┬────────┐
# │ a    ┆ b      │
# │ ---  ┆ ---    │
# │ f64  ┆ f64    │
# ╞══════╪════════╡
# │ 1.25 ┆ 0.1875 │
# └──────┴────────┘

Returns:

(LazyFrame)



2100
2101
2102

# File 'lib/polars/lazy_frame.rb', line 2100

def var(ddof: 1)
  _from_rbldf(_ldf.var(ddof))
end

#width ⇒ `Integer`

Get the width of the LazyFrame.

Examples:

lf = Polars::DataFrame.new({"foo" => [1, 2, 3], "bar" => [4, 5, 6]}).lazy
lf.width
# => 2

Returns:

(Integer)



252
253
254

# File 'lib/polars/lazy_frame.rb', line 252

def width
  _ldf.width
end

#with_column(column) ⇒ `LazyFrame`

Add or overwrite column in a DataFrame.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.with_column((Polars.col("b") ** 2).alias("b_squared")).collect
# =>
# shape: (3, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ b_squared │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ i64 ┆ f64       │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ 2   ┆ 4.0       │
# │ 3   ┆ 4   ┆ 16.0      │
# │ 5   ┆ 6   ┆ 36.0      │
# └─────┴─────┴───────────┘

df.with_column(Polars.col("a") ** 2).collect
# =>
# shape: (3, 2)
# ┌──────┬─────┐
# │ a    ┆ b   │
# │ ---  ┆ --- │
# │ f64  ┆ i64 │
# ╞══════╪═════╡
# │ 1.0  ┆ 2   │
# │ 9.0  ┆ 4   │
# │ 25.0 ┆ 6   │
# └──────┴─────┘

Parameters:

column (Object) —
Expression that evaluates to column or a Series to use.

Returns:

(LazyFrame)



1713
1714
1715

# File 'lib/polars/lazy_frame.rb', line 1713

def with_column(column)
  with_columns([column])
end

#with_columns(exprs) ⇒ `LazyFrame`

Add or overwrite multiple columns in a DataFrame.

Examples:

ldf = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3, 4],
    "b" => [0.5, 4, 10, 13],
    "c" => [true, true, false, true]
  }
).lazy
ldf.with_columns(
  [
    (Polars.col("a") ** 2).alias("a^2"),
    (Polars.col("b") / 2).alias("b/2"),
    (Polars.col("c").is_not).alias("not c")
  ]
).collect
# =>
# shape: (4, 6)
# ┌─────┬──────┬───────┬──────┬──────┬───────┐
# │ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
# │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
# │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
# ╞═════╪══════╪═══════╪══════╪══════╪═══════╡
# │ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
# │ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
# │ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
# │ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
# └─────┴──────┴───────┴──────┴──────┴───────┘

Parameters:

exprs (Object) —
List of Expressions that evaluate to columns.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1611

def with_columns(exprs)
  exprs =
    if exprs.nil?
      []
    elsif exprs.is_a?(Expr)
      [exprs]
    else
      exprs.to_a
    end

  rbexprs = []
  exprs.each do |e|
    case e
    when Expr
      rbexprs << e._rbexpr
    when Series
      rbexprs << Utils.lit(e)._rbexpr
    else
      raise ArgumentError, "Expected an expression, got #{e}"
    end
  end

  _from_rbldf(_ldf.with_columns(rbexprs))
end

#with_context(other) ⇒ `LazyFrame`

Add an external context to the computation graph.

This allows expressions to also access columns from DataFrames that are not part of this one.

Examples:

df_a = Polars::DataFrame.new({"a" => [1, 2, 3], "b" => ["a", "c", nil]}).lazy
df_other = Polars::DataFrame.new({"c" => ["foo", "ham"]})
(
  df_a.with_context(df_other.lazy).select(
    [Polars.col("b") + Polars.col("c").first]
  )
).collect
# =>
# shape: (3, 1)
# ┌──────┐
# │ b    │
# │ ---  │
# │ str  │
# ╞══════╡
# │ afoo │
# │ cfoo │
# │ null │
# └──────┘

Parameters:

other (Object) —
Lazy DataFrame to join with.

Returns:

(LazyFrame)

# File 'lib/polars/lazy_frame.rb', line 1665

def with_context(other)
  if !other.is_a?(::Array)
    other = [other]
  end

  _from_rbldf(_ldf.with_context(other.map(&:_ldf)))
end

#with_row_count(name: "row_nr", offset: 0) ⇒ `LazyFrame`

Note:

This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Add a column at index 0 that counts the rows.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 3, 5],
    "b" => [2, 4, 6]
  }
).lazy
df.with_row_count.collect
# =>
# shape: (3, 3)
# ┌────────┬─────┬─────┐
# │ row_nr ┆ a   ┆ b   │
# │ ---    ┆ --- ┆ --- │
# │ u32    ┆ i64 ┆ i64 │
# ╞════════╪═════╪═════╡
# │ 0      ┆ 1   ┆ 2   │
# │ 1      ┆ 3   ┆ 4   │
# │ 2      ┆ 5   ┆ 6   │
# └────────┴─────┴─────┘

Parameters:

name (String) (defaults to: "row_nr") —
Name of the column to add.
offset (Integer) (defaults to: 0) —
Start the row count at this offset.

Returns:

(LazyFrame)



1970
1971
1972

# File 'lib/polars/lazy_frame.rb', line 1970

def with_row_count(name: "row_nr", offset: 0)
  _from_rbldf(_ldf.with_row_count(name, offset))
end

#write_json(file) ⇒ `nil`

Write the logical plan of this LazyFrame to a file or string in JSON format.

Parameters:

file (String) —
File path to which the result should be written.

Returns:

(nil)

# File 'lib/polars/lazy_frame.rb', line 285

def write_json(file)
  if Utils.pathlike?(file)
    file = Utils.normalise_filepath(file)
  end
  _ldf.write_json(file)
  nil
end

Class: Polars::LazyFrame

Overview

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame

Class Method Details

.read_json(file) ⇒ LazyFrame

Instance Method Details

#cache ⇒ LazyFrame

#cleared ⇒ LazyFrame

Examples:

#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false, _eager: false) ⇒ DataFrame

Examples:

#columns ⇒ Array

Examples:

#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ String

#describe_plan ⇒ String

#drop(columns) ⇒ LazyFrame

#drop_nulls(subset: nil) ⇒ LazyFrame

Examples:

#dtypes ⇒ Array

Examples:

#explode(columns) ⇒ LazyFrame

Examples:

#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ DataFrame

Examples:

#fill_nan(fill_value) ⇒ LazyFrame

Examples:

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame

#filter(predicate) ⇒ LazyFrame

Examples:

Filter on one condition:

Filter on multiple conditions:

#first ⇒ LazyFrame

#group_by(by, maintain_order: false) ⇒ LazyGroupBy Also known as: groupby, group

Examples:

#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window", check_sorted: true) ⇒ DataFrame Also known as: groupby_dynamic

Examples:

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

The window boundaries can also be added to the aggregation result.

When closed="left", should not include right end of interval.

When closed="both" the time values at the window boundaries belong to 2 groups.

Dynamic group bys can also be combined with grouping on normal keys.

Dynamic group by on an index column.

#group_by_rolling(index_column:, period:, offset: nil, closed: "right", by: nil, check_sorted: true) ⇒ LazyFrame Also known as: groupby_rolling

Examples:

#head(n = 5) ⇒ LazyFrame

#include?(key) ⇒ Boolean

#interpolate ⇒ LazyFrame

Examples:

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", allow_parallel: true, force_parallel: false) ⇒ LazyFrame

Examples:

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ LazyFrame

#last ⇒ LazyFrame

#lazy ⇒ LazyFrame

Examples:

#limit(n = 5) ⇒ LazyFrame

#max ⇒ LazyFrame

Examples:

#mean ⇒ LazyFrame

Examples:

#median ⇒ LazyFrame

Examples:

#melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame

Examples:

#min ⇒ LazyFrame

Examples:

#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame

Examples:

#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame

Examples:

#rename(mapping) ⇒ LazyFrame

#reverse ⇒ LazyFrame

#schema ⇒ Hash

Examples:

#select(exprs) ⇒ LazyFrame

Examples:

#set_sorted(column, *more_columns, descending: false) ⇒ LazyFrame

#shift(n, fill_value: nil) ⇒ LazyFrame

#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ `LazyFrame`

.read_json(file) ⇒ `LazyFrame`

#cache ⇒ `LazyFrame`

#cleared ⇒ `LazyFrame`

#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false, _eager: false) ⇒ `DataFrame`

#columns ⇒ `Array`

#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ `String`

#describe_plan ⇒ `String`

#drop(columns) ⇒ `LazyFrame`

#drop_nulls(subset: nil) ⇒ `LazyFrame`

#dtypes ⇒ `Array`

#explode(columns) ⇒ `LazyFrame`

#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ `DataFrame`

#fill_nan(fill_value) ⇒ `LazyFrame`

#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ `LazyFrame`

#filter(predicate) ⇒ `LazyFrame`

#first ⇒ `LazyFrame`

#group_by(by, maintain_order: false) ⇒ `LazyGroupBy` Also known as: groupby, group

#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window", check_sorted: true) ⇒ `DataFrame` Also known as: groupby_dynamic

#group_by_rolling(index_column:, period:, offset: nil, closed: "right", by: nil, check_sorted: true) ⇒ `LazyFrame` Also known as: groupby_rolling

#head(n = 5) ⇒ `LazyFrame`

#include?(key) ⇒ `Boolean`

#interpolate ⇒ `LazyFrame`

#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", allow_parallel: true, force_parallel: false) ⇒ `LazyFrame`

#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ `LazyFrame`

#last ⇒ `LazyFrame`

#lazy ⇒ `LazyFrame`

#limit(n = 5) ⇒ `LazyFrame`

#max ⇒ `LazyFrame`

#mean ⇒ `LazyFrame`

#median ⇒ `LazyFrame`

#melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ `LazyFrame`

#min ⇒ `LazyFrame`

#pipe(func, *args, **kwargs, &block) ⇒ `LazyFrame`

#quantile(quantile, interpolation: "nearest") ⇒ `LazyFrame`

#rename(mapping) ⇒ `LazyFrame`

#reverse ⇒ `LazyFrame`

#schema ⇒ `Hash`

#select(exprs) ⇒ `LazyFrame`

#set_sorted(column, *more_columns, descending: false) ⇒ `LazyFrame`

#shift(n, fill_value: nil) ⇒ `LazyFrame`

#shift_and_fill(periods, fill_value) ⇒ `LazyFrame`

#slice(offset, length = nil) ⇒ `LazyFrame`

#sort(by, reverse: false, nulls_last: false, maintain_order: false) ⇒ `LazyFrame`

#std(ddof: 1) ⇒ `LazyFrame`

#sum ⇒ `LazyFrame`

#tail(n = 5) ⇒ `LazyFrame`

#take_every(n) ⇒ `LazyFrame`

#to_s ⇒ `String`

#unique(maintain_order: true, subset: nil, keep: "first") ⇒ `LazyFrame`

#unnest(names) ⇒ `LazyFrame`

#var(ddof: 1) ⇒ `LazyFrame`

#width ⇒ `Integer`

#with_column(column) ⇒ `LazyFrame`

#with_columns(exprs) ⇒ `LazyFrame`

#with_context(other) ⇒ `LazyFrame`

#with_row_count(name: "row_nr", offset: 0) ⇒ `LazyFrame`

#write_json(file) ⇒ `nil`