Module: Polars::LazyFunctions

Included in:: Polars

Defined in:: lib/polars/lazy_functions.rb

Instance Method Summary collapse

#all(name = nil) ⇒ Expr
Do one of two things.
#any(name) ⇒ Expr
Evaluate columnwise or elementwise with a bitwise OR operation.
#arg_sort_by(exprs, reverse: false) ⇒ Expr (also: #argsort_by)
Find the indexes that would sort the columns.
#arg_where(condition, eager: false) ⇒ Expr, Series
Return indices where condition evaluates true.
#avg(column) ⇒ Expr, Float
Get the mean value.
#coalesce(exprs, *more_exprs) ⇒ Expr
Folds the expressions from left to right, keeping the first non-null value.
#col(name) ⇒ Expr
Return an expression representing a column in a DataFrame.
#collect_all(lazy_frames, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ Array
Collect multiple LazyFrames at the same time.
#concat_list(exprs) ⇒ Expr
Concat the arrays in a Series dtype List in linear time.
#concat_str(exprs, sep: "") ⇒ Expr
Horizontally concat Utf8 Series in linear time.
#count(column = nil) ⇒ Expr, Integer
Count the number of values in this column/context.
#cov(a, b) ⇒ Expr
Compute the covariance between two columns/ expressions.
#cumfold(acc, f, exprs, include_init: false) ⇒ Object
Cumulatively accumulate over multiple columns horizontally/row wise with a left fold.
#cumsum(column) ⇒ Object
Cumulatively sum values in a column/Series, or horizontally across list of columns/expressions.
#duration(weeks: nil, days: nil, hours: nil, minutes: nil, seconds: nil, milliseconds: nil, microseconds: nil, nanoseconds: nil, time_unit: "us") ⇒ Expr
Create polars Duration from distinct time components.
#element ⇒ Expr
Alias for an element in evaluated in an eval expression.
#exclude(columns) ⇒ Object
Exclude certain columns from a wildcard/regex selection.
#first(column = nil) ⇒ Object
Get the first value.
#fold(acc, f, exprs) ⇒ Expr
Accumulate over multiple columns horizontally/row wise with a left fold.
#format(fstring, *args) ⇒ Expr
Format expressions as a string.
#from_epoch(column, unit: "s", eager: false) ⇒ Object
Utility function that parses an epoch timestamp (or Unix time) to Polars Date(time).
#groups(column) ⇒ Object
Syntactic sugar for Polars.col("foo").agg_groups.
#head(column, n = 10) ⇒ Object
Get the first n rows.
#int_range(start, stop, step: 1, eager: false, dtype: nil) ⇒ Expr, Series (also: #arange)
Create a range expression (or Series).
#last(column = nil) ⇒ Object
Get the last value.
#lit(value, dtype: nil, allow_object: nil) ⇒ Expr
Return an expression representing a literal value.
#max(column) ⇒ Expr, Object
Get the maximum value.
#mean(column) ⇒ Expr, Float
Get the mean value.
#median(column) ⇒ Object
Get the median value.
#min(column) ⇒ Expr, Object
Get the minimum value.
#n_unique(column) ⇒ Object
Count unique values.
#pearson_corr(a, b, ddof: 1) ⇒ Expr
Compute the pearson's correlation between two columns.
#quantile(column, quantile, interpolation: "nearest") ⇒ Expr
Syntactic sugar for Polars.col("foo").quantile(...).
#repeat(value, n, dtype: nil, eager: false, name: nil) ⇒ Expr
Repeat a single value n times.
#select(exprs) ⇒ DataFrame
Run polars expressions without a context.
#spearman_rank_corr(a, b, ddof: 1, propagate_nans: false) ⇒ Expr
Compute the spearman rank correlation between two columns.
#std(column, ddof: 1) ⇒ Object
Get the standard deviation.
#struct(exprs, eager: false) ⇒ Object
Collect several columns into a Series of dtype Struct.
#sum(column) ⇒ Object
Sum values in a column/Series, or horizontally across list of columns/expressions.
#tail(column, n = 10) ⇒ Object
Get the last n rows.
#to_list(name) ⇒ Expr
Aggregate to list.
#var(column, ddof: 1) ⇒ Object
Get the variance.
#when(expr) ⇒ When
Start a "when, then, otherwise" expression.

Instance Method Details

#all(name = nil) ⇒ `Expr`

Do one of two things.

function can do a columnwise or elementwise AND operation
a wildcard column selection

Examples:

Sum all columns

df = Polars::DataFrame.new(
  {"a" => [1, 2, 3], "b" => ["hello", "foo", "bar"], "c" => [1, 1, 1]}
)
df.select(Polars.all.sum)
# =>
# shape: (1, 3)
# ┌─────┬──────┬─────┐
# │ a   ┆ b    ┆ c   │
# │ --- ┆ ---  ┆ --- │
# │ i64 ┆ str  ┆ i64 │
# ╞═════╪══════╪═════╡
# │ 6   ┆ null ┆ 3   │
# └─────┴──────┴─────┘

Parameters:

name (Object) (defaults to: nil) —
If given this function will apply a bitwise & on the columns.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 576

def all(name = nil)
  if name.nil?
    col("*")
  elsif Utils.strlike?(name)
    col(name).all
  else
    raise Todo
  end
end

#any(name) ⇒ `Expr`

Evaluate columnwise or elementwise with a bitwise OR operation.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 481

def any(name)
  if Utils.strlike?(name)
    col(name).any
  else
    fold(lit(false), ->(a, b) { a.cast(:bool) | b.cast(:bool) }, name).alias("any")
  end
end

#arg_sort_by(exprs, reverse: false) ⇒ `Expr` Also known as: argsort_by

Find the indexes that would sort the columns.

Argsort by multiple columns. The first column will be used for the ordering. If there are duplicates in the first column, the second column will be used to determine the ordering and so on.

Parameters:

exprs (Object) —
Columns use to determine the ordering.
reverse (Boolean) (defaults to: false) —
Default is ascending.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 662

def arg_sort_by(exprs, reverse: false)
  if !exprs.is_a?(::Array)
    exprs = [exprs]
  end
  if reverse == true || reverse == false
    reverse = [reverse] * exprs.length
  end
  exprs = Utils.selection_to_rbexpr_list(exprs)
  Utils.wrap_expr(RbExpr.arg_sort_by(exprs, reverse))
end

#arg_where(condition, eager: false) ⇒ `Expr`, `Series`

Return indices where condition evaluates true.

Examples:

df = Polars::DataFrame.new({"a" => [1, 2, 3, 4, 5]})
df.select(
  [
    Polars.arg_where(Polars.col("a") % 2 == 0)
  ]
).to_series
# =>
# shape: (2,)
# Series: 'a' [u32]
# [
#         1
#         3
# ]

Parameters:

condition (Expr) —
Boolean expression to evaluate
eager (Boolean) (defaults to: false) —
Whether to apply this function eagerly (as opposed to lazily).

Returns:

(Expr, Series)

# File 'lib/polars/lazy_functions.rb', line 1048

def arg_where(condition, eager: false)
  if eager
    if !condition.is_a?(Series)
      raise ArgumentError, "expected 'Series' in 'arg_where' if 'eager=True', got #{condition.class.name}"
    end
    condition.to_frame.select(arg_where(Polars.col(condition.name))).to_series
  else
    condition = Utils.expr_to_lit_or_expr(condition, str_to_lit: true)
    Utils.wrap_expr(_arg_where(condition._rbexpr))
  end
end

#avg(column) ⇒ `Expr`, `Float`

Get the mean value.

Returns:

(Expr, Float)



165
166
167

# File 'lib/polars/lazy_functions.rb', line 165

def avg(column)
  mean(column)
end

#coalesce(exprs, *more_exprs) ⇒ `Expr`

Folds the expressions from left to right, keeping the first non-null value.

Examples:

df = Polars::DataFrame.new(
  [
    [nil, 1.0, 1.0],
    [nil, 2.0, 2.0],
    [nil, nil, 3.0],
    [nil, nil, nil]
  ],
  columns: [["a", :f64], ["b", :f64], ["c", :f64]]
)
df.with_column(Polars.coalesce(["a", "b", "c", 99.9]).alias("d"))
# =>
# shape: (4, 4)
# ┌──────┬──────┬──────┬──────┐
# │ a    ┆ b    ┆ c    ┆ d    │
# │ ---  ┆ ---  ┆ ---  ┆ ---  │
# │ f64  ┆ f64  ┆ f64  ┆ f64  │
# ╞══════╪══════╪══════╪══════╡
# │ null ┆ 1.0  ┆ 1.0  ┆ 1.0  │
# │ null ┆ 2.0  ┆ 2.0  ┆ 2.0  │
# │ null ┆ null ┆ 3.0  ┆ 3.0  │
# │ null ┆ null ┆ null ┆ 99.9 │
# └──────┴──────┴──────┴──────┘

Parameters:

exprs (Object) —
Expressions to coalesce.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 1090

def coalesce(exprs, *more_exprs)
  exprs = Utils.selection_to_rbexpr_list(exprs)
  if more_exprs.any?
    exprs.concat(Utils.selection_to_rbexpr_list(more_exprs))
  end
  Utils.wrap_expr(_coalesce_exprs(exprs))
end

#col(name) ⇒ `Expr`

Return an expression representing a column in a DataFrame.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 6

def col(name)
  if name.is_a?(Series)
    name = name.to_a
  end

  if name.is_a?(Class) && name < DataType
    name = [name]
  end

  if name.is_a?(DataType)
    Utils.wrap_expr(_dtype_cols([name]))
  elsif name.is_a?(::Array)
    if name.length == 0 || Utils.strlike?(name[0])
      name = name.map { |v| v.is_a?(Symbol) ? v.to_s : v }
      Utils.wrap_expr(RbExpr.cols(name))
    elsif Utils.is_polars_dtype(name[0])
      Utils.wrap_expr(_dtype_cols(name))
    else
      raise ArgumentError, "Expected list values to be all `str` or all `DataType`"
    end
  else
    name = name.to_s if name.is_a?(Symbol)
    Utils.wrap_expr(RbExpr.col(name))
  end
end

#collect_all(lazy_frames, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ `Array`

Collect multiple LazyFrames at the same time.

This runs all the computation graphs in parallel on Polars threadpool.

Parameters:

lazy_frames (Boolean) —
A list of LazyFrames to collect.
type_coercion (Boolean) (defaults to: true) —
Do type coercion optimization.
predicate_pushdown (Boolean) (defaults to: true) —
Do predicate pushdown optimization.
projection_pushdown (Boolean) (defaults to: true) —
Do projection pushdown optimization.
simplify_expression (Boolean) (defaults to: true) —
Run simplify expressions optimization.
string_cache (Boolean) (defaults to: false) —
This argument is deprecated and will be ignored
no_optimization (Boolean) (defaults to: false) —
Turn off optimizations.
slice_pushdown (Boolean) (defaults to: true) —
Slice pushdown optimization.
common_subplan_elimination (Boolean) (defaults to: true) —
Will try to cache branching subplans that occur on self-joins or unions.
allow_streaming (Boolean) (defaults to: false) —
Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:

(Array)

# File 'lib/polars/lazy_functions.rb', line 889

def collect_all(
  lazy_frames,
  type_coercion: true,
  predicate_pushdown: true,
  projection_pushdown: true,
  simplify_expression: true,
  string_cache: false,
  no_optimization: false,
  slice_pushdown: true,
  common_subplan_elimination: true,
  allow_streaming: false
)
  if no_optimization
    predicate_pushdown = false
    projection_pushdown = false
    slice_pushdown = false
    common_subplan_elimination = false
  end

  prepared = []

  lazy_frames.each do |lf|
    ldf = lf._ldf.optimization_toggle(
      type_coercion,
      predicate_pushdown,
      projection_pushdown,
      simplify_expression,
      slice_pushdown,
      common_subplan_elimination,
      allow_streaming,
      false
    )
    prepared << ldf
  end

  out = _collect_all(prepared)

  # wrap the rbdataframes into dataframe
  result = out.map { |rbdf| Utils.wrap_df(rbdf) }

  result
end

#concat_list(exprs) ⇒ `Expr`

Concat the arrays in a Series dtype List in linear time.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 858

def concat_list(exprs)
  exprs = Utils.selection_to_rbexpr_list(exprs)
  Utils.wrap_expr(RbExpr.concat_lst(exprs))
end

#concat_str(exprs, sep: "") ⇒ `Expr`

Horizontally concat Utf8 Series in linear time. Non-Utf8 columns are cast to Utf8.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2, 3],
    "b" => ["dogs", "cats", nil],
    "c" => ["play", "swim", "walk"]
  }
)
df.with_columns(
  [
    Polars.concat_str(
      [
        Polars.col("a") * 2,
        Polars.col("b"),
        Polars.col("c")
      ],
      sep: " "
    ).alias("full_sentence")
  ]
)
# =>
# shape: (3, 4)
# ┌─────┬──────┬──────┬───────────────┐
# │ a   ┆ b    ┆ c    ┆ full_sentence │
# │ --- ┆ ---  ┆ ---  ┆ ---           │
# │ i64 ┆ str  ┆ str  ┆ str           │
# ╞═════╪══════╪══════╪═══════════════╡
# │ 1   ┆ dogs ┆ play ┆ 2 dogs play   │
# │ 2   ┆ cats ┆ swim ┆ 4 cats swim   │
# │ 3   ┆ null ┆ walk ┆ null          │
# └─────┴──────┴──────┴───────────────┘

Parameters:

exprs (Object) —
Columns to concat into a Utf8 Series.
sep (String) (defaults to: "") —
String value that will be used to separate the values.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 797

def concat_str(exprs, sep: "")
  exprs = Utils.selection_to_rbexpr_list(exprs)
  return Utils.wrap_expr(RbExpr.concat_str(exprs, sep))
end

#count(column = nil) ⇒ `Expr`, `Integer`

Count the number of values in this column/context.

Parameters:

column (String, Series, nil) (defaults to: nil) —
If dtype is:
- Series : count the values in the series.
- String : count the values in this column.
- None : count the number of values in this context.

Returns:

(Expr, Integer)

# File 'lib/polars/lazy_functions.rb', line 66

def count(column = nil)
  if column.nil?
    return Utils.wrap_expr(RbExpr.count)
  end

  if column.is_a?(Series)
    column.len
  else
    col(column).count
  end
end

#cov(a, b) ⇒ `Expr`

Compute the covariance between two columns/ expressions.

Parameters:

a (Object) —
Column name or Expression.
b (Object) —
Column name or Expression.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 413

def cov(a, b)
  if Utils.strlike?(a)
    a = col(a)
  end
  if Utils.strlike?(b)
    b = col(b)
  end
  Utils.wrap_expr(RbExpr.cov(a._rbexpr, b._rbexpr))
end

#cumfold(acc, f, exprs, include_init: false) ⇒ `Object`

Note:

If you simply want the first encountered expression as accumulator, consider using cumreduce.

Cumulatively accumulate over multiple columns horizontally/row wise with a left fold.

Every cumulative result is added as a separate field in a Struct column.

Parameters:

acc (Object) —
Accumulator Expression. This is the value that will be initialized when the fold starts. For a sum this could for instance be lit(0).
f (Object) —
Function to apply over the accumulator and the value. Fn(acc, value) -> new_value
exprs (Object) —
Expressions to aggregate over. May also be a wildcard expression.
include_init (Boolean) (defaults to: false) —
Include the initial accumulator state as struct field.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 465

def cumfold(acc, f, exprs, include_init: false)
  acc = Utils.expr_to_lit_or_expr(acc, str_to_lit: true)
  if exprs.is_a?(Expr)
    exprs = [exprs]
  end

  exprs = Utils.selection_to_rbexpr_list(exprs)
  Utils.wrap_expr(RbExpr.cumfold(acc._rbexpr, f, exprs, include_init))
end

#cumsum(column) ⇒ `Object`

Cumulatively sum values in a column/Series, or horizontally across list of columns/expressions.

Examples:

df = Polars::DataFrame.new(
  {
    "a" => [1, 2],
    "b" => [3, 4],
    "c" => [5, 6]
  }
)
# =>
# shape: (2, 3)
# ┌─────┬─────┬─────┐
# │ a   ┆ b   ┆ c   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 3   ┆ 5   │
# │ 2   ┆ 4   ┆ 6   │
# └─────┴─────┴─────┘

Cumulatively sum a column by name:

df.select(Polars.cumsum("a"))
# =>
# shape: (2, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# │ 3   │
# └─────┘

Cumulatively sum a list of columns/expressions horizontally:

df.with_column(Polars.cumsum(["a", "c"]))
# =>
# shape: (2, 4)
# ┌─────┬─────┬─────┬───────────┐
# │ a   ┆ b   ┆ c   ┆ cumsum    │
# │ --- ┆ --- ┆ --- ┆ ---       │
# │ i64 ┆ i64 ┆ i64 ┆ struct[2] │
# ╞═════╪═════╪═════╪═══════════╡
# │ 1   ┆ 3   ┆ 5   ┆ {1,6}     │
# │ 2   ┆ 4   ┆ 6   ┆ {2,8}     │
# └─────┴─────┴─────┴───────────┘

Parameters:

column (Object) —
Column(s) to be used in aggregation.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 349

def cumsum(column)
  if column.is_a?(Series)
    column.cumsum
  elsif Utils.strlike?(column)
    col(column).cumsum
  else
    cumfold(lit(0).cast(:u32), ->(a, b) { a + b }, column).alias("cumsum")
  end
end

#duration(weeks: nil, days: nil, hours: nil, minutes: nil, seconds: nil, milliseconds: nil, microseconds: nil, nanoseconds: nil, time_unit: "us") ⇒ `Expr`

Create polars Duration from distinct time components.

Examples:

df = Polars::DataFrame.new(
  {
    "datetime" => [DateTime.new(2022, 1, 1), DateTime.new(2022, 1, 2)],
    "add" => [1, 2]
  }
)
df.select(
  [
    (Polars.col("datetime") + Polars.duration(weeks: "add")).alias("add_weeks"),
    (Polars.col("datetime") + Polars.duration(days: "add")).alias("add_days"),
    (Polars.col("datetime") + Polars.duration(seconds: "add")).alias("add_seconds"),
    (Polars.col("datetime") + Polars.duration(milliseconds: "add")).alias(
      "add_milliseconds"
    ),
    (Polars.col("datetime") + Polars.duration(hours: "add")).alias("add_hours")
  ]
)
# =>
# shape: (2, 5)
# ┌─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────────┬─────────────────────┐
# │ add_weeks           ┆ add_days            ┆ add_seconds         ┆ add_milliseconds        ┆ add_hours           │
# │ ---                 ┆ ---                 ┆ ---                 ┆ ---                     ┆ ---                 │
# │ datetime[ns]        ┆ datetime[ns]        ┆ datetime[ns]        ┆ datetime[ns]            ┆ datetime[ns]        │
# ╞═════════════════════╪═════════════════════╪═════════════════════╪═════════════════════════╪═════════════════════╡
# │ 2022-01-08 00:00:00 ┆ 2022-01-02 00:00:00 ┆ 2022-01-01 00:00:01 ┆ 2022-01-01 00:00:00.001 ┆ 2022-01-01 01:00:00 │
# │ 2022-01-16 00:00:00 ┆ 2022-01-04 00:00:00 ┆ 2022-01-02 00:00:02 ┆ 2022-01-02 00:00:00.002 ┆ 2022-01-02 02:00:00 │
# └─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────────┴─────────────────────┘

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 706

def duration(
  weeks: nil,
  days: nil,
  hours: nil,
  minutes: nil,
  seconds: nil,
  milliseconds: nil,
  microseconds: nil,
  nanoseconds: nil,
  time_unit: "us"
)
  if !weeks.nil?
    weeks = Utils.expr_to_lit_or_expr(weeks, str_to_lit: false)._rbexpr
  end
  if !days.nil?
    days = Utils.expr_to_lit_or_expr(days, str_to_lit: false)._rbexpr
  end
  if !hours.nil?
    hours = Utils.expr_to_lit_or_expr(hours, str_to_lit: false)._rbexpr
  end
  if !minutes.nil?
    minutes = Utils.expr_to_lit_or_expr(minutes, str_to_lit: false)._rbexpr
  end
  if !seconds.nil?
    seconds = Utils.expr_to_lit_or_expr(seconds, str_to_lit: false)._rbexpr
  end
  if !milliseconds.nil?
    milliseconds = Utils.expr_to_lit_or_expr(milliseconds, str_to_lit: false)._rbexpr
  end
  if !microseconds.nil?
    microseconds = Utils.expr_to_lit_or_expr(microseconds, str_to_lit: false)._rbexpr
  end
  if !nanoseconds.nil?
    nanoseconds = Utils.expr_to_lit_or_expr(nanoseconds, str_to_lit: false)._rbexpr
  end

  Utils.wrap_expr(
    _rb_duration(
      weeks,
      days,
      hours,
      minutes,
      seconds,
      milliseconds,
      microseconds,
      nanoseconds,
      time_unit
    )
  )
end

#element ⇒ `Expr`

Alias for an element in evaluated in an eval expression.

Examples:

A horizontal rank computation by taking the elements of a list

df = Polars::DataFrame.new({"a" => [1, 8, 3], "b" => [4, 5, 2]})
df.with_column(
  Polars.concat_list(["a", "b"]).list.eval(Polars.element.rank).alias("rank")
)
# =>
# shape: (3, 3)
# ┌─────┬─────┬────────────┐
# │ a   ┆ b   ┆ rank       │
# │ --- ┆ --- ┆ ---        │
# │ i64 ┆ i64 ┆ list[f64]  │
# ╞═════╪═════╪════════════╡
# │ 1   ┆ 4   ┆ [1.0, 2.0] │
# │ 8   ┆ 5   ┆ [2.0, 1.0] │
# │ 3   ┆ 2   ┆ [2.0, 1.0] │
# └─────┴─────┴────────────┘

Returns:

(Expr)



52
53
54

# File 'lib/polars/lazy_functions.rb', line 52

def element
  col("")
end

#exclude(columns) ⇒ `Object`

Exclude certain columns from a wildcard/regex selection.

Examples:

df = Polars::DataFrame.new(
  {
    "aa" => [1, 2, 3],
    "ba" => ["a", "b", nil],
    "cc" => [nil, 2.5, 1.5]
  }
)
# =>
# shape: (3, 3)
# ┌─────┬──────┬──────┐
# │ aa  ┆ ba   ┆ cc   │
# │ --- ┆ ---  ┆ ---  │
# │ i64 ┆ str  ┆ f64  │
# ╞═════╪══════╪══════╡
# │ 1   ┆ a    ┆ null │
# │ 2   ┆ b    ┆ 2.5  │
# │ 3   ┆ null ┆ 1.5  │
# └─────┴──────┴──────┘

Exclude by column name(s):

df.select(Polars.exclude("ba"))
# =>
# shape: (3, 2)
# ┌─────┬──────┐
# │ aa  ┆ cc   │
# │ --- ┆ ---  │
# │ i64 ┆ f64  │
# ╞═════╪══════╡
# │ 1   ┆ null │
# │ 2   ┆ 2.5  │
# │ 3   ┆ 1.5  │
# └─────┴──────┘

Exclude by regex, e.g. removing all columns whose names end with the letter "a":

df.select(Polars.exclude("^.*a$"))
# =>
# shape: (3, 1)
# ┌──────┐
# │ cc   │
# │ ---  │
# │ f64  │
# ╞══════╡
# │ null │
# │ 2.5  │
# │ 1.5  │
# └──────┘

Parameters:

columns (Object) —
Column(s) to exclude from selection This can be:
- a column name, or multiple column names
- a regular expression starting with ^ and ending with $
- a dtype or multiple dtypes

Returns:

(Object)



548
549
550

# File 'lib/polars/lazy_functions.rb', line 548

def exclude(columns)
  col("*").exclude(columns)
end

#first(column = nil) ⇒ `Object`

Get the first value.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 194

def first(column = nil)
  if column.nil?
    return Utils.wrap_expr(RbExpr.first)
  end

  if column.is_a?(Series)
    if column.len > 0
      column[0]
    else
      raise IndexError, "The series is empty, so no first value can be returned."
    end
  else
    col(column).first
  end
end

#fold(acc, f, exprs) ⇒ `Expr`

Accumulate over multiple columns horizontally/row wise with a left fold.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 432

def fold(acc, f, exprs)
  acc = Utils.expr_to_lit_or_expr(acc, str_to_lit: true)
  if exprs.is_a?(Expr)
    exprs = [exprs]
  end

  exprs = Utils.selection_to_rbexpr_list(exprs)
  Utils.wrap_expr(RbExpr.fold(acc._rbexpr, f, exprs))
end

#format(fstring, *args) ⇒ `Expr`

Format expressions as a string.

Examples:

df = Polars::DataFrame.new(
  {
    "a": ["a", "b", "c"],
    "b": [1, 2, 3]
  }
)
df.select(
  [
    Polars.format("foo_{}_bar_{}", Polars.col("a"), "b").alias("fmt")
  ]
)
# =>
# shape: (3, 1)
# ┌─────────────┐
# │ fmt         │
# │ ---         │
# │ str         │
# ╞═════════════╡
# │ foo_a_bar_1 │
# │ foo_b_bar_2 │
# │ foo_c_bar_3 │
# └─────────────┘

Parameters:

fstring (String) —
A string that with placeholders. For example: "hello_{}" or "{}_world
args (Object) —
Expression(s) that fill the placeholders

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 835

def format(fstring, *args)
  if fstring.scan("{}").length != args.length
    raise ArgumentError, "number of placeholders should equal the number of arguments"
  end

  exprs = []

  arguments = args.each
  fstring.split(/(\{\})/).each do |s|
    if s == "{}"
      e = Utils.expr_to_lit_or_expr(arguments.next, str_to_lit: false)
      exprs << e
    elsif s.length > 0
      exprs << lit(s)
    end
  end

  concat_str(exprs, sep: "")
end

#from_epoch(column, unit: "s", eager: false) ⇒ `Object`

Utility function that parses an epoch timestamp (or Unix time) to Polars Date(time).

Depending on the unit provided, this function will return a different dtype:

unit: "d" returns pl.Date
unit: "s" returns pl.Datetime"us"
unit: "ms" returns pl.Datetime["ms"]
unit: "us" returns pl.Datetime["us"]
unit: "ns" returns pl.Datetime["ns"]

Examples:

df = Polars::DataFrame.new({"timestamp" => [1666683077, 1666683099]}).lazy
df.select(Polars.from_epoch(Polars.col("timestamp"), unit: "s")).collect
# =>
# shape: (2, 1)
# ┌─────────────────────┐
# │ timestamp           │
# │ ---                 │
# │ datetime[μs]        │
# ╞═════════════════════╡
# │ 2022-10-25 07:31:17 │
# │ 2022-10-25 07:31:39 │
# └─────────────────────┘

Parameters:

column (Object) —
Series or expression to parse integers to pl.Datetime.
unit (String) (defaults to: "s") —
The unit of the timesteps since epoch time.
eager (Boolean) (defaults to: false) —
If eager evaluation is true, a Series is returned instead of an Expr.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 1129

def from_epoch(column, unit: "s", eager: false)
  if Utils.strlike?(column)
    column = col(column)
  elsif !column.is_a?(Series) && !column.is_a?(Expr)
    column = Series.new(column)
  end

  if unit == "d"
    expr = column.cast(Date)
  elsif unit == "s"
    expr = (column.cast(Int64) * 1_000_000).cast(Datetime.new("us"))
  elsif Utils::DTYPE_TEMPORAL_UNITS.include?(unit)
    expr = column.cast(Datetime.new(unit))
  else
    raise ArgumentError, "'unit' must be one of {{'ns', 'us', 'ms', 's', 'd'}}, got '#{unit}'."
  end

  if eager
    if !column.is_a?(Series)
      raise ArgumentError, "expected Series or Array if eager: true, got #{column.class.name}"
    else
      column.to_frame.select(expr).to_series
    end
  else
    expr
  end
end

#groups(column) ⇒ `Object`

Syntactic sugar for Polars.col("foo").agg_groups.

Returns:

(Object)



589
590
591

# File 'lib/polars/lazy_functions.rb', line 589

def groups(column)
  col(column).agg_groups
end

#head(column, n = 10) ⇒ `Object`

Get the first n rows.

Parameters:

column (Object) —
Column name or Series.
n (Integer) (defaults to: 10) —
Number of rows to return.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 242

def head(column, n = 10)
  if column.is_a?(Series)
    column.head(n)
  else
    col(column).head(n)
  end
end

#int_range(start, stop, step: 1, eager: false, dtype: nil) ⇒ `Expr`, `Series` Also known as: arange

Create a range expression (or Series).

This can be used in a select, with_column, etc. Be sure that the resulting range size is equal to the length of the DataFrame you are collecting.

Examples:

Polars.arange(0, 3, eager: true)
# =>
# shape: (3,)
# Series: 'arange' [i64]
# [
#         0
#         1
#         2
# ]

Parameters:

start (Integer, Expr, Series) —
Lower bound of range.
stop (Integer, Expr, Series) —
Upper bound of range.
step (Integer) (defaults to: 1) —
Step size of the range.
eager (Boolean) (defaults to: false) —
If eager evaluation is True, a Series is returned instead of an Expr.
dtype (Symbol) (defaults to: nil) —
Apply an explicit integer dtype to the resulting expression (default is Int64).

Returns:

(Expr, Series)

# File 'lib/polars/lazy_functions.rb', line 635

def int_range(start, stop, step: 1, eager: false, dtype: nil)
  start = Utils.parse_as_expression(start)
  stop = Utils.parse_as_expression(stop)
  dtype ||= Int64
  dtype = dtype.to_s if dtype.is_a?(Symbol)
  result = Utils.wrap_expr(RbExpr.int_range(start, stop, step, dtype)).alias("arange")

  if eager
    return select(result).to_series
  end

  result
end

#last(column = nil) ⇒ `Object`

Get the last value.

Depending on the input type this function does different things:

nil -> expression to take last column of a context.
String -> syntactic sugar for Polars.col(..).last
Series -> Take last value in Series

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 219

def last(column = nil)
  if column.nil?
    return Utils.wrap_expr(_last)
  end

  if column.is_a?(Series)
    if column.len > 0
      return column[-1]
    else
      raise IndexError, "The series is empty, so no last value can be returned"
    end
  end
  col(column).last
end

#lit(value, dtype: nil, allow_object: nil) ⇒ `Expr`

Return an expression representing a literal value.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 269

def lit(value, dtype: nil, allow_object: nil)
  if value.is_a?(::Time) || value.is_a?(::DateTime)
    time_unit = dtype&.time_unit || "ns"
    time_zone = dtype.&time_zone
    e = lit(Utils._datetime_to_pl_timestamp(value, time_unit)).cast(Datetime.new(time_unit))
    if time_zone
      return e.dt.replace_time_zone(time_zone.to_s)
    else
      return e
    end
  elsif value.is_a?(::Date)
    return lit(::Time.utc(value.year, value.month, value.day)).cast(Date)
  elsif value.is_a?(Polars::Series)
    name = value.name
    value = value._s
    e = Utils.wrap_expr(RbExpr.lit(value, allow_object))
    if name == ""
      return e
    end
    return e.alias(name)
  elsif (defined?(Numo::NArray) && value.is_a?(Numo::NArray)) || value.is_a?(::Array)
    return lit(Series.new("", value))
  elsif dtype
    return Utils.wrap_expr(RbExpr.lit(value, allow_object)).cast(dtype)
  end

  Utils.wrap_expr(RbExpr.lit(value, allow_object))
end

#max(column) ⇒ `Expr`, `Object`

Get the maximum value.

Parameters:

column (Object) —
Column(s) to be used in aggregation.

Returns:

(Expr, Object)

# File 'lib/polars/lazy_functions.rb', line 113

def max(column)
  if column.is_a?(Series)
    column.max
  else
    col(column).max
  end
end

#mean(column) ⇒ `Expr`, `Float`

Get the mean value.

Returns:

(Expr, Float)

# File 'lib/polars/lazy_functions.rb', line 154

def mean(column)
  if column.is_a?(Series)
    column.mean
  else
    col(column).mean
  end
end

#median(column) ⇒ `Object`

Get the median value.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 172

def median(column)
  if column.is_a?(Series)
    column.median
  else
    col(column).median
  end
end

#min(column) ⇒ `Expr`, `Object`

Get the minimum value.

Parameters:

column (Object) —
Column(s) to be used in aggregation.

Returns:

(Expr, Object)

# File 'lib/polars/lazy_functions.rb', line 127

def min(column)
  if column.is_a?(Series)
    column.min
  else
    col(column).min
  end
end

#n_unique(column) ⇒ `Object`

Count unique values.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 183

def n_unique(column)
  if column.is_a?(Series)
    column.n_unique
  else
    col(column).n_unique
  end
end

#pearson_corr(a, b, ddof: 1) ⇒ `Expr`

Compute the pearson's correlation between two columns.

Parameters:

a (Object) —
Column name or Expression.
b (Object) —
Column name or Expression.
ddof (Integer) (defaults to: 1) —
Delta degrees of freedom

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 395

def pearson_corr(a, b, ddof: 1)
  if Utils.strlike?(a)
    a = col(a)
  end
  if Utils.strlike?(b)
    b = col(b)
  end
  Utils.wrap_expr(RbExpr.pearson_corr(a._rbexpr, b._rbexpr, ddof))
end

#quantile(column, quantile, interpolation: "nearest") ⇒ `Expr`

Syntactic sugar for Polars.col("foo").quantile(...).

Parameters:

column (String) —
Column name.
quantile (Float) —
Quantile between 0.0 and 1.0.
interpolation ("nearest", "higher", "lower", "midpoint", "linear") (defaults to: "nearest") —
Interpolation method.

Returns:

(Expr)



603
604
605

# File 'lib/polars/lazy_functions.rb', line 603

def quantile(column, quantile, interpolation: "nearest")
  col(column).quantile(quantile, interpolation: interpolation)
end

#repeat(value, n, dtype: nil, eager: false, name: nil) ⇒ `Expr`

Repeat a single value n times.

Parameters:

value (Object) —
Value to repeat.
n (Integer) —
Repeat n times.
eager (Boolean) (defaults to: false) —
Run eagerly and collect into a Series.
name (String) (defaults to: nil) —
Only used in eager mode. As expression, use alias.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 1005

def repeat(value, n, dtype: nil, eager: false, name: nil)
  if !name.nil?
    warn "the `name` argument is deprecated. Use the `alias` method instead."
  end

  if n.is_a?(Integer)
    n = lit(n)
  end

  value = Utils.parse_as_expression(value, str_as_lit: true)
  expr = Utils.wrap_expr(RbExpr.repeat(value, n._rbexpr, dtype))
  if !name.nil?
    expr = expr.alias(name)
  end
  if eager
    return select(expr).to_series
  end
  expr
end

#select(exprs) ⇒ `DataFrame`

Run polars expressions without a context.

Returns:

(DataFrame)



935
936
937

# File 'lib/polars/lazy_functions.rb', line 935

def select(exprs)
  DataFrame.new([]).select(exprs)
end

#spearman_rank_corr(a, b, ddof: 1, propagate_nans: false) ⇒ `Expr`

Compute the spearman rank correlation between two columns.

Missing data will be excluded from the computation.

Parameters:

a (Object) —
Column name or Expression.
b (Object) —
Column name or Expression.
ddof (Integer) (defaults to: 1) —
Delta degrees of freedom
propagate_nans (Boolean) (defaults to: false) —
If True any NaN encountered will lead to NaN in the output. Defaults to False where NaN are regarded as larger than any finite number and thus lead to the highest rank.

Returns:

(Expr)

# File 'lib/polars/lazy_functions.rb', line 375

def spearman_rank_corr(a, b, ddof: 1, propagate_nans: false)
  if Utils.strlike?(a)
    a = col(a)
  end
  if Utils.strlike?(b)
    b = col(b)
  end
  Utils.wrap_expr(RbExpr.spearman_rank_corr(a._rbexpr, b._rbexpr, ddof, propagate_nans))
end

#std(column, ddof: 1) ⇒ `Object`

Get the standard deviation.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 88

def std(column, ddof: 1)
  if column.is_a?(Series)
    column.std(ddof: ddof)
  else
    col(column).std(ddof: ddof)
  end
end

#struct(exprs, eager: false) ⇒ `Object`

Collect several columns into a Series of dtype Struct.

Examples:

Polars::DataFrame.new(
  {
    "int" => [1, 2],
    "str" => ["a", "b"],
    "bool" => [true, nil],
    "list" => [[1, 2], [3]],
  }
).select([Polars.struct(Polars.all).alias("my_struct")])
# =>
# shape: (2, 1)
# ┌─────────────────────┐
# │ my_struct           │
# │ ---                 │
# │ struct[4]           │
# ╞═════════════════════╡
# │ {1,"a",true,[1, 2]} │
# │ {2,"b",null,[3]}    │
# └─────────────────────┘

Only collect specific columns as a struct:

df = Polars::DataFrame.new(
  {"a" => [1, 2, 3, 4], "b" => ["one", "two", "three", "four"], "c" => [9, 8, 7, 6]}
)
df.with_column(Polars.struct(Polars.col(["a", "b"])).alias("a_and_b"))
# =>
# shape: (4, 4)
# ┌─────┬───────┬─────┬─────────────┐
# │ a   ┆ b     ┆ c   ┆ a_and_b     │
# │ --- ┆ ---   ┆ --- ┆ ---         │
# │ i64 ┆ str   ┆ i64 ┆ struct[2]   │
# ╞═════╪═══════╪═════╪═════════════╡
# │ 1   ┆ one   ┆ 9   ┆ {1,"one"}   │
# │ 2   ┆ two   ┆ 8   ┆ {2,"two"}   │
# │ 3   ┆ three ┆ 7   ┆ {3,"three"} │
# │ 4   ┆ four  ┆ 6   ┆ {4,"four"}  │
# └─────┴───────┴─────┴─────────────┘

Parameters:

exprs (Object) —
Columns/Expressions to collect into a Struct
eager (Boolean) (defaults to: false) —
Evaluate immediately

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 985

def struct(exprs, eager: false)
  if eager
    Polars.select(struct(exprs, eager: false)).to_series
  end
  exprs = Utils.selection_to_rbexpr_list(exprs)
  Utils.wrap_expr(_as_struct(exprs))
end

#sum(column) ⇒ `Object`

Sum values in a column/Series, or horizontally across list of columns/expressions.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 138

def sum(column)
  if column.is_a?(Series)
    column.sum
  elsif Utils.strlike?(column)
    col(column.to_s).sum
  elsif column.is_a?(::Array)
    exprs = Utils.selection_to_rbexpr_list(column)
    Utils.wrap_expr(_sum_horizontal(exprs))
  else
    fold(lit(0).cast(:u32), ->(a, b) { a + b }, column).alias("sum")
  end
end

#tail(column, n = 10) ⇒ `Object`

Get the last n rows.

Parameters:

column (Object) —
Column name or Series.
n (Integer) (defaults to: 10) —
Number of rows to return.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 258

def tail(column, n = 10)
  if column.is_a?(Series)
    column.tail(n)
  else
    col(column).tail(n)
  end
end

#to_list(name) ⇒ `Expr`

Aggregate to list.

Returns:

(Expr)



81
82
83

# File 'lib/polars/lazy_functions.rb', line 81

def to_list(name)
  col(name).list
end

#var(column, ddof: 1) ⇒ `Object`

Get the variance.

Returns:

(Object)

# File 'lib/polars/lazy_functions.rb', line 99

def var(column, ddof: 1)
  if column.is_a?(Series)
    column.var(ddof: ddof)
  else
    col(column).var(ddof: ddof)
  end
end

#when(expr) ⇒ `When`

Start a "when, then, otherwise" expression.

Examples:

df = Polars::DataFrame.new({"foo" => [1, 3, 4], "bar" => [3, 4, 0]})
df.with_column(Polars.when(Polars.col("foo") > 2).then(Polars.lit(1)).otherwise(Polars.lit(-1)))
# =>
# shape: (3, 3)
# ┌─────┬─────┬─────────┐
# │ foo ┆ bar ┆ literal │
# │ --- ┆ --- ┆ ---     │
# │ i64 ┆ i64 ┆ i32     │
# ╞═════╪═════╪═════════╡
# │ 1   ┆ 3   ┆ -1      │
# │ 3   ┆ 4   ┆ 1       │
# │ 4   ┆ 0   ┆ 1       │
# └─────┴─────┴─────────┘

Returns:

(When)

# File 'lib/polars/lazy_functions.rb', line 1175

def when(expr)
  expr = Utils.expr_to_lit_or_expr(expr)
  pw = RbExpr.when(expr._rbexpr)
  When.new(pw)
end

Module: Polars::LazyFunctions

Instance Method Summary collapse

Instance Method Details

#all(name = nil) ⇒ Expr

Examples:

Sum all columns

#any(name) ⇒ Expr

#arg_sort_by(exprs, reverse: false) ⇒ Expr Also known as: argsort_by

#arg_where(condition, eager: false) ⇒ Expr, Series

Examples:

#avg(column) ⇒ Expr, Float

#coalesce(exprs, *more_exprs) ⇒ Expr

Examples:

#col(name) ⇒ Expr

#collect_all(lazy_frames, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ Array

#concat_list(exprs) ⇒ Expr

#concat_str(exprs, sep: "") ⇒ Expr

Examples:

#count(column = nil) ⇒ Expr, Integer

#cov(a, b) ⇒ Expr

#cumfold(acc, f, exprs, include_init: false) ⇒ Object

#cumsum(column) ⇒ Object

Examples:

Cumulatively sum a column by name:

Cumulatively sum a list of columns/expressions horizontally:

#duration(weeks: nil, days: nil, hours: nil, minutes: nil, seconds: nil, milliseconds: nil, microseconds: nil, nanoseconds: nil, time_unit: "us") ⇒ Expr

Examples:

#element ⇒ Expr

Examples:

A horizontal rank computation by taking the elements of a list

#exclude(columns) ⇒ Object

Examples:

Exclude by column name(s):

Exclude by regex, e.g. removing all columns whose names end with the letter "a":

#first(column = nil) ⇒ Object

#fold(acc, f, exprs) ⇒ Expr

#format(fstring, *args) ⇒ Expr

Examples:

#from_epoch(column, unit: "s", eager: false) ⇒ Object

Examples:

#groups(column) ⇒ Object

#head(column, n = 10) ⇒ Object

#int_range(start, stop, step: 1, eager: false, dtype: nil) ⇒ Expr, Series Also known as: arange

Examples:

#last(column = nil) ⇒ Object

#lit(value, dtype: nil, allow_object: nil) ⇒ Expr

#max(column) ⇒ Expr, Object

#mean(column) ⇒ Expr, Float

#median(column) ⇒ Object

#min(column) ⇒ Expr, Object

#n_unique(column) ⇒ Object

#pearson_corr(a, b, ddof: 1) ⇒ Expr

#quantile(column, quantile, interpolation: "nearest") ⇒ Expr

#repeat(value, n, dtype: nil, eager: false, name: nil) ⇒ Expr

#select(exprs) ⇒ DataFrame

#spearman_rank_corr(a, b, ddof: 1, propagate_nans: false) ⇒ Expr

#std(column, ddof: 1) ⇒ Object

#struct(exprs, eager: false) ⇒ Object

Examples:

Only collect specific columns as a struct:

#sum(column) ⇒ Object

#tail(column, n = 10) ⇒ Object

#to_list(name) ⇒ Expr

#var(column, ddof: 1) ⇒ Object

#when(expr) ⇒ When

Examples:

#all(name = nil) ⇒ `Expr`

#any(name) ⇒ `Expr`

#arg_sort_by(exprs, reverse: false) ⇒ `Expr` Also known as: argsort_by

#arg_where(condition, eager: false) ⇒ `Expr`, `Series`

#avg(column) ⇒ `Expr`, `Float`

#coalesce(exprs, *more_exprs) ⇒ `Expr`

#col(name) ⇒ `Expr`

#collect_all(lazy_frames, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, allow_streaming: false) ⇒ `Array`

#concat_list(exprs) ⇒ `Expr`

#concat_str(exprs, sep: "") ⇒ `Expr`

#count(column = nil) ⇒ `Expr`, `Integer`

#cov(a, b) ⇒ `Expr`

#cumfold(acc, f, exprs, include_init: false) ⇒ `Object`

#cumsum(column) ⇒ `Object`

#duration(weeks: nil, days: nil, hours: nil, minutes: nil, seconds: nil, milliseconds: nil, microseconds: nil, nanoseconds: nil, time_unit: "us") ⇒ `Expr`

#element ⇒ `Expr`

#exclude(columns) ⇒ `Object`

#first(column = nil) ⇒ `Object`

#fold(acc, f, exprs) ⇒ `Expr`

#format(fstring, *args) ⇒ `Expr`

#from_epoch(column, unit: "s", eager: false) ⇒ `Object`

#groups(column) ⇒ `Object`

#head(column, n = 10) ⇒ `Object`

#int_range(start, stop, step: 1, eager: false, dtype: nil) ⇒ `Expr`, `Series` Also known as: arange

#last(column = nil) ⇒ `Object`

#lit(value, dtype: nil, allow_object: nil) ⇒ `Expr`

#max(column) ⇒ `Expr`, `Object`

#mean(column) ⇒ `Expr`, `Float`

#median(column) ⇒ `Object`

#min(column) ⇒ `Expr`, `Object`

#n_unique(column) ⇒ `Object`

#pearson_corr(a, b, ddof: 1) ⇒ `Expr`

#quantile(column, quantile, interpolation: "nearest") ⇒ `Expr`

#repeat(value, n, dtype: nil, eager: false, name: nil) ⇒ `Expr`

#select(exprs) ⇒ `DataFrame`

#spearman_rank_corr(a, b, ddof: 1, propagate_nans: false) ⇒ `Expr`

#std(column, ddof: 1) ⇒ `Object`

#struct(exprs, eager: false) ⇒ `Object`

#sum(column) ⇒ `Object`

#tail(column, n = 10) ⇒ `Object`

#to_list(name) ⇒ `Expr`

#var(column, ddof: 1) ⇒ `Object`

#when(expr) ⇒ `When`