Class: Polars::LazyFrame
- Inherits:
-
Object
- Object
- Polars::LazyFrame
- Defined in:
- lib/polars/lazy_frame.rb
Overview
Representation of a Lazy computation graph/query against a DataFrame.
Class Method Summary collapse
-
.deserialize(source) ⇒ LazyFrame
Read a logical plan from a file to construct a LazyFrame.
-
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
Instance Method Summary collapse
-
#bottom_k(k, by:, reverse: false) ⇒ LazyFrame
Return the
k
smallest rows. -
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
-
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
-
#clear(n = 0) ⇒ LazyFrame
(also: #cleared)
Create an empty copy of the current LazyFrame.
-
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
-
#collect_schema ⇒ Schema
Resolve the schema of this LazyFrame.
-
#columns ⇒ Array
Get or set column names.
-
#count ⇒ LazyFrame
Return the number of non-null elements for each column.
-
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
-
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
-
#drop(*columns, strict: true) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
-
#drop_nans(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more NaN values.
-
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more null values.
-
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
-
#explode(columns, *more_columns) ⇒ LazyFrame
Explode lists to long format.
-
#fetch(n_rows = 500, **kwargs) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
-
#fill_nan(fill_value) ⇒ LazyFrame
Fill floating point NaN values.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
-
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#first ⇒ LazyFrame
Get the first row of the DataFrame.
-
#gather_every(n) ⇒ LazyFrame
(also: #take_every)
Take every nth row in the LazyFrame and return as a new LazyFrame.
-
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy
(also: #groupby, #group)
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window") ⇒ DataFrame
(also: #groupby_dynamic)
Group based on a time value (or index value of type
:i32
,:i64
). -
#head(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
constructor
Create a new LazyFrame.
-
#interpolate ⇒ LazyFrame
Interpolate intermediate values.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame
Perform an asof join.
-
#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame
Perform a join based on one or multiple (in)equality predicates.
-
#last ⇒ LazyFrame
Get the last row of the DataFrame.
-
#lazy ⇒ LazyFrame
Return lazy representation, i.e.
-
#limit(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
-
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
-
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
-
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
-
#null_count ⇒ LazyFrame
Aggregate the columns in the LazyFrame as the sum of their null value count.
-
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
-
#remove(*predicates, **constraints) ⇒ LazyFrame
Remove rows, dropping those that match the given predicate expression(s).
-
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
-
#reverse ⇒ LazyFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ LazyFrame
(also: #group_by_rolling, #groupby_rolling)
Create rolling groups based on a time column.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
-
#select_seq(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this LazyFrame.
-
#serialize(file = nil) ⇒ Object
Serialize the logical plan of this LazyFrame to a file or string.
-
#set_sorted(column, descending: false) ⇒ LazyFrame
Flag a column as sorted.
-
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
-
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
-
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
-
#sink_ipc(path, compression: "zstd", maintain_order: true, storage_options: nil, retries: 2, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
-
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
-
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Persists a LazyFrame at the provided path.
-
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
-
#sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
-
#sql(query, table_name: "self") ⇒ Expr
Execute a SQL query against the LazyFrame.
-
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
-
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
-
#tail(n = 5) ⇒ LazyFrame
Get the last
n
rows. -
#to_s ⇒ String
Returns a string representing the LazyFrame.
-
#top_k(k, by:, reverse: false) ⇒ LazyFrame
Return the
k
largest rows. -
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
-
#unnest(columns, *more_columns) ⇒ LazyFrame
Decompose a struct into its fields.
-
#unpivot(on, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
(also: #melt)
Unpivot a DataFrame from wide to long format.
-
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame
Update the values in this
LazyFrame
with the values inother
. -
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
-
#width ⇒ Integer
Get the width of the LazyFrame.
-
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
-
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
-
#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame
Add columns to this LazyFrame.
-
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
-
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame
(also: #with_row_count)
Add a column at index 0 that counts the rows.
-
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
Create a new LazyFrame.
8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# File 'lib/polars/lazy_frame.rb', line 8 def initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) self._ldf = ( DataFrame.new( data, schema: schema, schema_overrides: schema_overrides, orient: orient, infer_schema_length: infer_schema_length, nan_to_null: nan_to_null ) .lazy ._ldf ) end |
Class Method Details
.deserialize(source) ⇒ LazyFrame
This function uses marshaling if the logical plan contains Ruby UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Read a logical plan from a file to construct a LazyFrame.
74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/polars/lazy_frame.rb', line 74 def self.deserialize(source) raise Todo unless RbLazyFrame.respond_to?(:deserialize_binary) if Utils.pathlike?(source) source = Utils.normalize_filepath(source) end deserializer = RbLazyFrame.method(:deserialize_binary) _from_rbldf(deserializer.(source)) end |
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
36 37 38 39 40 41 42 |
# File 'lib/polars/lazy_frame.rb', line 36 def self.read_json(file) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end Utils.wrap_ldf(RbLazyFrame.deserialize_json(file)) end |
Instance Method Details
#bottom_k(k, by:, reverse: false) ⇒ LazyFrame
Return the k
smallest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse
. The output is not guaranteed to be in any
particular order, call :func:sort
after this function if you wish the
output to be sorted.
547 548 549 550 551 552 553 554 555 |
# File 'lib/polars/lazy_frame.rb', line 547 def bottom_k( k, by:, reverse: false ) by = Utils.parse_into_list_of_expressions(by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") _from_rbldf(_ldf.bottom_k(k, by, reverse)) end |
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
1308 1309 1310 |
# File 'lib/polars/lazy_frame.rb', line 1308 def cache _from_rbldf(_ldf.cache) end |
#cast(dtypes, strict: true) ⇒ LazyFrame
Cast LazyFrame column(s) to the specified dtype(s).
1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 |
# File 'lib/polars/lazy_frame.rb', line 1361 def cast(dtypes, strict: true) if !dtypes.is_a?(Hash) return _from_rbldf(_ldf.cast_all(dtypes, strict)) end cast_map = {} dtypes.each do |c, dtype| dtype = Utils.parse_into_dtype(dtype) cast_map.merge!( c.is_a?(::String) ? {c => dtype} : Utils.(self, c).to_h { |x| [x, dtype] } ) end _from_rbldf(_ldf.cast(cast_map, strict)) end |
#clear(n = 0) ⇒ LazyFrame Also known as: cleared
Create an empty copy of the current LazyFrame.
The copy has an identical schema but no data.
1413 1414 1415 |
# File 'lib/polars/lazy_frame.rb', line 1413 def clear(n = 0) DataFrame.new(schema: schema).clear(n).lazy end |
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
Note: use #fetch if you want to run your query on the first n
rows
only. This can be a huge time saver in debugging queries.
609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 |
# File 'lib/polars/lazy_frame.rb', line 609 def collect( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false ) if no_optimization predicate_pushdown = false projection_pushdown = false slice_pushdown = false common_subplan_elimination = false comm_subexpr_elim = false end if allow_streaming common_subplan_elimination = false end ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, _eager ) Utils.wrap_df(ldf.collect) end |
#collect_schema ⇒ Schema
Resolve the schema of this LazyFrame.
679 680 681 |
# File 'lib/polars/lazy_frame.rb', line 679 def collect_schema Schema.new(_ldf.collect_schema, check_dtypes: false) end |
#columns ⇒ Array
Get or set column names.
104 105 106 |
# File 'lib/polars/lazy_frame.rb', line 104 def columns _ldf.collect_schema.keys end |
#count ⇒ LazyFrame
Return the number of non-null elements for each column.
4515 4516 4517 |
# File 'lib/polars/lazy_frame.rb', line 4515 def count _from_rbldf(_ldf.count) end |
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 |
# File 'lib/polars/lazy_frame.rb', line 270 def describe_optimized_plan( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false ) ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, false ) ldf.describe_optimized_plan end |
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
263 264 265 |
# File 'lib/polars/lazy_frame.rb', line 263 def describe_plan _ldf.describe_plan end |
#drop(*columns, strict: true) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 |
# File 'lib/polars/lazy_frame.rb', line 3018 def drop(*columns, strict: true) selectors = [] columns.each do |c| if c.is_a?(Enumerable) selectors += c else selectors += [c] end end drop_cols = Utils.parse_list_into_selector(selectors, strict: strict) _from_rbldf(_ldf.drop(drop_cols._rbselector)) end |
#drop_nans(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
3975 3976 3977 3978 3979 3980 3981 |
# File 'lib/polars/lazy_frame.rb', line 3975 def drop_nans(subset: nil) selector_subset = nil if !subset.nil? selector_subset = Utils.parse_list_into_selector(subset)._rbselector end _from_rbldf(_ldf.drop_nans(selector_subset)) end |
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
4024 4025 4026 4027 4028 4029 4030 |
# File 'lib/polars/lazy_frame.rb', line 4024 def drop_nulls(subset: nil) selector_subset = nil if !subset.nil? selector_subset = Utils.parse_list_into_selector(subset)._rbselector end _from_rbldf(_ldf.drop_nulls(selector_subset)) end |
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
122 123 124 |
# File 'lib/polars/lazy_frame.rb', line 122 def dtypes _ldf.collect_schema.values end |
#explode(columns, *more_columns) ⇒ LazyFrame
Explode lists to long format.
3857 3858 3859 3860 3861 3862 |
# File 'lib/polars/lazy_frame.rb', line 3857 def explode(columns, *more_columns) subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector( more_columns ) _from_rbldf(_ldf.explode(subset._rbselector)) end |
#fetch(n_rows = 500, **kwargs) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
Fetch is like a #collect operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.
Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.
1281 1282 1283 |
# File 'lib/polars/lazy_frame.rb', line 1281 def fetch(n_rows = 500, **kwargs) head(n_rows).collect(**kwargs) end |
#fill_nan(fill_value) ⇒ LazyFrame
Note that floating point NaN (Not a Number) are not missing values!
To replace missing values, use fill_null
instead.
Fill floating point NaN values.
3606 3607 3608 3609 3610 3611 |
# File 'lib/polars/lazy_frame.rb', line 3606 def fill_nan(fill_value) if !fill_value.is_a?(Expr) fill_value = F.lit(fill_value) end _from_rbldf(_ldf.fill_nan(fill_value._rbexpr)) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
3571 3572 3573 |
# File 'lib/polars/lazy_frame.rb', line 3571 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) select(Polars.all.fill_null(value, strategy: strategy, limit: limit)) end |
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
1456 1457 1458 1459 1460 1461 1462 |
# File 'lib/polars/lazy_frame.rb', line 1456 def filter(predicate) _from_rbldf( _ldf.filter( Utils.parse_into_expression(predicate, str_as_lit: false) ) ) end |
#first ⇒ LazyFrame
Get the first row of the DataFrame.
3439 3440 3441 |
# File 'lib/polars/lazy_frame.rb', line 3439 def first slice(0, 1) end |
#gather_every(n) ⇒ LazyFrame Also known as: take_every
Take every nth row in the LazyFrame and return as a new LazyFrame.
3497 3498 3499 |
# File 'lib/polars/lazy_frame.rb', line 3497 def gather_every(n) select(F.col("*").gather_every(n)) end |
#group_by(*by, maintain_order: false, **named_by) ⇒ LazyGroupBy Also known as: groupby, group
Start a group by operation.
1750 1751 1752 1753 1754 |
# File 'lib/polars/lazy_frame.rb', line 1750 def group_by(*by, maintain_order: false, **named_by) exprs = Utils.parse_into_list_of_expressions(*by, **named_by) lgb = _ldf.group_by(exprs, maintain_order) LazyGroupBy.new(lgb) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window") ⇒ DataFrame Also known as: groupby_dynamic
Group based on a time value (or index value of type :i32
, :i64
).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every
, period
and offset
arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 |
# File 'lib/polars/lazy_frame.rb', line 2121 def group_by_dynamic( index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window" ) if !truncate.nil? label = truncate ? "left" : "datapoint" end index_column = Utils.parse_into_expression(index_column, str_as_lit: false) if offset.nil? offset = period.nil? ? "-#{every}" : "0ns" end if period.nil? period = every end period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) every = Utils.parse_as_duration_string(every) rbexprs_by = by.nil? ? [] : Utils.parse_into_list_of_expressions(by) lgb = _ldf.group_by_dynamic( index_column, every, period, offset, label, include_boundaries, closed, rbexprs_by, start_by ) LazyGroupBy.new(lgb) end |
#head(n = 5) ⇒ LazyFrame
3344 3345 3346 |
# File 'lib/polars/lazy_frame.rb', line 3344 def head(n = 5) slice(0, n) end |
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
159 160 161 |
# File 'lib/polars/lazy_frame.rb', line 159 def include?(key) columns.include?(key) end |
#interpolate ⇒ LazyFrame
Interpolate intermediate values. The interpolation method is linear.
4133 4134 4135 |
# File 'lib/polars/lazy_frame.rb', line 4133 def interpolate select(F.col("*").interpolate) end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil) ⇒ LazyFrame
Add a join operation to the Logical Plan.
2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 |
# File 'lib/polars/lazy_frame.rb', line 2638 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, allow_parallel: true, force_parallel: false, coalesce: nil, maintain_order: nil ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if maintain_order.nil? maintain_order = "none" end if how == "outer" how = "full" elsif how == "cross" return _from_rbldf( _ldf.join( other._ldf, [], [], allow_parallel, join_nulls, force_parallel, how, suffix, validate, maintain_order, coalesce ) ) end if !on.nil? rbexprs = Utils.parse_into_list_of_expressions(on) rbexprs_left = rbexprs rbexprs_right = rbexprs elsif !left_on.nil? && !right_on.nil? rbexprs_left = Utils.parse_into_list_of_expressions(left_on) rbexprs_right = Utils.parse_into_list_of_expressions(right_on) else raise ArgumentError, "must specify `on` OR `left_on` and `right_on`" end _from_rbldf( self._ldf.join( other._ldf, rbexprs_left, rbexprs_right, allow_parallel, force_parallel, join_nulls, how, suffix, validate, maintain_order, coalesce ) ) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ LazyFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 |
# File 'lib/polars/lazy_frame.rb', line 2425 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if on.is_a?(::String) left_on = on right_on = on end if left_on.nil? || right_on.nil? raise ArgumentError, "You should pass the column to join on as an argument." end if by_left.is_a?(::String) || by_left.is_a?(Expr) by_left_ = [by_left] else by_left_ = by_left end if by_right.is_a?(::String) || by_right.is_a?(Expr) by_right_ = [by_right] else by_right_ = by_right end if by.is_a?(::String) by_left_ = [by] by_right_ = [by] elsif by.is_a?(::Array) by_left_ = by by_right_ = by end tolerance_str = nil tolerance_num = nil if tolerance.is_a?(::String) tolerance_str = tolerance else tolerance_num = tolerance end _from_rbldf( _ldf.join_asof( other._ldf, Polars.col(left_on)._rbexpr, Polars.col(right_on)._rbexpr, by_left_, by_right_, allow_parallel, force_parallel, suffix, strategy, tolerance_num, tolerance_str, coalesce, allow_exact_matches, check_sortedness ) ) end |
#join_where(other, *predicates, suffix: "_right") ⇒ LazyFrame
The row order of the input DataFrames is not preserved.
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 |
# File 'lib/polars/lazy_frame.rb', line 2787 def join_where( other, *predicates, suffix: "_right" ) Utils.require_same_type(self, other) rbexprs = Utils.parse_into_list_of_expressions(*predicates) _from_rbldf( _ldf.join_where( other._ldf, rbexprs, suffix ) ) end |
#last ⇒ LazyFrame
Get the last row of the DataFrame.
3414 3415 3416 |
# File 'lib/polars/lazy_frame.rb', line 3414 def last tail(1) end |
#lazy ⇒ LazyFrame
Return lazy representation, i.e. itself.
Useful for writing code that expects either a DataFrame
or
LazyFrame
.
1301 1302 1303 |
# File 'lib/polars/lazy_frame.rb', line 1301 def lazy self end |
#limit(n = 5) ⇒ LazyFrame
3294 3295 3296 |
# File 'lib/polars/lazy_frame.rb', line 3294 def limit(n = 5) head(n) end |
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
3693 3694 3695 |
# File 'lib/polars/lazy_frame.rb', line 3693 def max _from_rbldf(_ldf.max) end |
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
3753 3754 3755 |
# File 'lib/polars/lazy_frame.rb', line 3753 def mean _from_rbldf(_ldf.mean) end |
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
3773 3774 3775 |
# File 'lib/polars/lazy_frame.rb', line 3773 def median _from_rbldf(_ldf.median) end |
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
4235 4236 4237 |
# File 'lib/polars/lazy_frame.rb', line 4235 def merge_sorted(other, key) _from_rbldf(_ldf.merge_sorted(other._ldf, key)) end |
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
3713 3714 3715 |
# File 'lib/polars/lazy_frame.rb', line 3713 def min _from_rbldf(_ldf.min) end |
#null_count ⇒ LazyFrame
Aggregate the columns in the LazyFrame as the sum of their null value count.
3799 3800 3801 |
# File 'lib/polars/lazy_frame.rb', line 3799 def null_count _from_rbldf(_ldf.null_count) end |
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
256 257 258 |
# File 'lib/polars/lazy_frame.rb', line 256 def pipe(func, *args, **kwargs, &block) func.call(self, *args, **kwargs, &block) end |
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
3824 3825 3826 3827 |
# File 'lib/polars/lazy_frame.rb', line 3824 def quantile(quantile, interpolation: "nearest") quantile = Utils.parse_into_expression(quantile, str_as_lit: false) _from_rbldf(_ldf.quantile(quantile, interpolation)) end |
#remove(*predicates, **constraints) ⇒ LazyFrame
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to true are retained
(this includes rows where the predicate evaluates as null
).
1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 |
# File 'lib/polars/lazy_frame.rb', line 1577 def remove( *predicates, **constraints ) if constraints.empty? # early-exit conditions (exclude/include all rows) if predicates.empty? || (predicates.length == 1 && predicates[0].is_a?(TrueClass)) return clear end if predicates.length == 1 && predicates[0].is_a?(FalseClass) return dup end end _filter( predicates: predicates, constraints: constraints, invert: true ) end |
#rename(mapping, strict: true) ⇒ LazyFrame
Rename column names.
3077 3078 3079 3080 3081 3082 3083 3084 3085 |
# File 'lib/polars/lazy_frame.rb', line 3077 def rename(mapping, strict: true) if mapping.respond_to?(:call) select(F.all.name.map(&mapping)) else existing = mapping.keys _new = mapping.values _from_rbldf(_ldf.rename(existing, _new, strict)) end end |
#reverse ⇒ LazyFrame
Reverse the DataFrame.
3110 3111 3112 |
# File 'lib/polars/lazy_frame.rb', line 3110 def reverse _from_rbldf(_ldf.reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ LazyFrame Also known as: group_by_rolling, groupby_rolling
Create rolling groups based on a time column.
Also works for index values of type :i32
or :i64
.
Different from a dynamic_group_by
the windows are now determined by the
individual values and are not of constant intervals. For constant intervals
use group_by_dynamic.
The period
and offset
arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 |
# File 'lib/polars/lazy_frame.rb', line 1842 def rolling( index_column:, period:, offset: nil, closed: "right", by: nil ) index_column = Utils.parse_into_expression(index_column) if offset.nil? offset = Utils.negate_duration_string(Utils.parse_as_duration_string(period)) end rbexprs_by = ( !by.nil? ? Utils.parse_into_list_of_expressions(by) : [] ) period = Utils.parse_as_duration_string(period) offset = Utils.parse_as_duration_string(offset) lgb = _ldf.rolling(index_column, period, offset, closed, rbexprs_by) LazyGroupBy.new(lgb) end |
#schema ⇒ Hash
Get the schema.
140 141 142 |
# File 'lib/polars/lazy_frame.rb', line 140 def schema _ldf.collect_schema end |
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
1686 1687 1688 1689 1690 1691 1692 1693 |
# File 'lib/polars/lazy_frame.rb', line 1686 def select(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select(rbexprs)) end |
#select_seq(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this LazyFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
1709 1710 1711 1712 1713 1714 1715 1716 |
# File 'lib/polars/lazy_frame.rb', line 1709 def select_seq(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0 rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select_seq(rbexprs)) end |
#serialize(file = nil) ⇒ Object
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Serialize the logical plan of this LazyFrame to a file or string.
218 219 220 221 222 223 |
# File 'lib/polars/lazy_frame.rb', line 218 def serialize(file = nil) raise Todo unless _ldf.respond_to?(:serialize_binary) serializer = _ldf.method(:serialize_binary) Utils.serialize_polars_object(serializer, file) end |
#set_sorted(column, descending: false) ⇒ LazyFrame
This can lead to incorrect results if the data is NOT sorted! Use with care!
Flag a column as sorted.
This can speed up future operations.
4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 |
# File 'lib/polars/lazy_frame.rb', line 4252 def set_sorted( column, descending: false ) if !Utils.strlike?(column) msg = "expected a 'str' for argument 'column' in 'set_sorted'" raise TypeError, msg end with_columns(F.col(column).set_sorted(descending: descending)) end |
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
3156 3157 3158 3159 3160 3161 3162 |
# File 'lib/polars/lazy_frame.rb', line 3156 def shift(n, fill_value: nil) if !fill_value.nil? fill_value = Utils.parse_into_expression(fill_value, str_as_lit: true) end n = Utils.parse_into_expression(n) _from_rbldf(_ldf.shift(n, fill_value)) end |
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
3206 3207 3208 |
# File 'lib/polars/lazy_frame.rb', line 3206 def shift_and_fill(periods, fill_value) shift(periods, fill_value: fill_value) end |
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 |
# File 'lib/polars/lazy_frame.rb', line 1039 def sink_csv( path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false ) Utils._check_arg_is_1byte("separator", separator, false) Utils._check_arg_is_1byte("quote_char", quote_char, false) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if &.any? = .to_a else = nil end = { "sync_on_close" => sync_on_close || "none", "maintain_order" => maintain_order, "mkdir" => mkdir } lf = lf.sink_csv( path, include_bom, include_header, separator.ord, line_terminator, quote_char.ord, batch_size, datetime_format, date_format, time_format, float_scientific, float_precision, decimal_comma, null_value, quote_style, , retries, ) lf = LazyFrame._from_rbldf(lf) if !lazy lf.collect return nil end lf end |
#sink_ipc(path, compression: "zstd", maintain_order: true, storage_options: nil, retries: 2, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 |
# File 'lib/polars/lazy_frame.rb', line 891 def sink_ipc( path, compression: "zstd", maintain_order: true, storage_options: nil, retries: 2, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, sync_on_close: nil, mkdir: false, lazy: false ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if &.any? = .to_a else = nil end = { "sync_on_close" => sync_on_close || "none", "maintain_order" => maintain_order, "mkdir" => mkdir } lf = lf.sink_ipc( path, compression, , retries, ) lf = LazyFrame._from_rbldf(lf) if !lazy lf.collect return nil end lf end |
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
This allows streaming results that are larger than RAM to be written to disk.
1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 |
# File 'lib/polars/lazy_frame.rb', line 1173 def sink_ndjson( path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if &.any? = .to_a else = nil end = { "sync_on_close" => sync_on_close || "none", "maintain_order" => maintain_order, "mkdir" => mkdir } lf = lf.sink_json(path, , retries, ) lf = LazyFrame._from_rbldf(lf) if !lazy lf.collect return nil end lf end |
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false) ⇒ DataFrame
Persists a LazyFrame at the provided path.
This allows streaming results that are larger than RAM to be written to disk.
757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 |
# File 'lib/polars/lazy_frame.rb', line 757 def sink_parquet( path, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true, storage_options: nil, retries: 2, sync_on_close: nil, mkdir: false, lazy: false ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) if statistics == true statistics = { min: true, max: true, distinct_count: false, null_count: true } elsif statistics == false statistics = {} elsif statistics == "full" statistics = { min: true, max: true, distinct_count: true, null_count: true } end if &.any? = .to_a else = nil end = { "sync_on_close" => sync_on_close || "none", "maintain_order" => maintain_order, "mkdir" => mkdir } lf = lf.sink_parquet( path, compression, compression_level, statistics, row_group_size, data_pagesize_limit, , retries, ) lf = LazyFrame._from_rbldf(lf) if !lazy lf.collect return nil end lf end |
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
3239 3240 3241 3242 3243 3244 |
# File 'lib/polars/lazy_frame.rb', line 3239 def slice(offset, length = nil) if length && length < 0 raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame" end _from_rbldf(_ldf.slice(offset, length)) end |
#sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) ⇒ LazyFrame
Sort the DataFrame.
Sorting can be done by:
- A single column name
- An expression
- Multiple expressions
343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 |
# File 'lib/polars/lazy_frame.rb', line 343 def sort(by, *more_by, reverse: false, nulls_last: false, maintain_order: false, multithreaded: true) if by.is_a?(::String) && more_by.empty? return _from_rbldf( _ldf.sort( by, reverse, nulls_last, maintain_order, multithreaded ) ) end by = Utils.parse_into_list_of_expressions(by, *more_by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") nulls_last = Utils.extend_bool(nulls_last, by.length, "nulls_last", "by") _from_rbldf( _ldf.sort_by_exprs( by, reverse, nulls_last, maintain_order, multithreaded ) ) end |
#sql(query, table_name: "self") ⇒ Expr
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- The calling frame is automatically registered as a table in the SQL context
under the name "self". If you want access to the DataFrames and LazyFrames
found in the current globals, use the top-level
Polars.sql
. - More control over registration and execution behaviour is available by
using the
SQLContext
object.
Execute a SQL query against the LazyFrame.
422 423 424 425 426 427 |
# File 'lib/polars/lazy_frame.rb', line 422 def sql(query, table_name: "self") ctx = Polars::SQLContext.new name = table_name || "self" ctx.register(name, self) ctx.execute(query) end |
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
3641 3642 3643 |
# File 'lib/polars/lazy_frame.rb', line 3641 def std(ddof: 1) _from_rbldf(_ldf.std(ddof)) end |
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
3733 3734 3735 |
# File 'lib/polars/lazy_frame.rb', line 3733 def sum _from_rbldf(_ldf.sum) end |
#tail(n = 5) ⇒ LazyFrame
Get the last n
rows.
3389 3390 3391 |
# File 'lib/polars/lazy_frame.rb', line 3389 def tail(n = 5) _from_rbldf(_ldf.tail(n)) end |
#to_s ⇒ String
Returns a string representing the LazyFrame.
171 172 173 174 175 176 177 |
# File 'lib/polars/lazy_frame.rb', line 171 def to_s <<~EOS naive plan: (run LazyFrame#describe_optimized_plan to see the optimized plan) #{describe_plan} EOS end |
#top_k(k, by:, reverse: false) ⇒ LazyFrame
Return the k
largest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse
. The output is not guaranteed to be in any
particular order, call :func:sort
after this function if you wish the
output to be sorted.
483 484 485 486 487 488 489 490 491 |
# File 'lib/polars/lazy_frame.rb', line 483 def top_k( k, by:, reverse: false ) by = Utils.parse_into_list_of_expressions(by) reverse = Utils.extend_bool(reverse, by.length, "reverse", "by") _from_rbldf(_ldf.top_k(k, by, reverse)) end |
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List
in the DataFrame or
subset.
3925 3926 3927 3928 3929 3930 3931 |
# File 'lib/polars/lazy_frame.rb', line 3925 def unique(maintain_order: true, subset: nil, keep: "first") selector_subset = nil if !subset.nil? selector_subset = Utils.parse_list_into_selector(subset)._rbselector end _from_rbldf(_ldf.unique(maintain_order, selector_subset, keep)) end |
#unnest(columns, *more_columns) ⇒ LazyFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame
on the location of the
struct
type.
4190 4191 4192 4193 4194 4195 |
# File 'lib/polars/lazy_frame.rb', line 4190 def unnest(columns, *more_columns) subset = Utils.parse_list_into_selector(columns) | Utils.parse_list_into_selector( more_columns ) _from_rbldf(_ldf.unnest(subset._rbselector)) end |
#unpivot(on, index: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame Also known as: melt
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 |
# File 'lib/polars/lazy_frame.rb', line 4080 def unpivot( on, index: nil, variable_name: nil, value_name: nil, streamable: true ) if !streamable warn "The `streamable` parameter for `LazyFrame.unpivot` is deprecated" end selector_on = on.nil? ? Selectors.empty : Utils.parse_list_into_selector(on) selector_index = index.nil? ? Selectors.empty : Utils.parse_list_into_selector(index) _from_rbldf( _ldf.unpivot( selector_on._rbselector, selector_index._rbselector, value_name, variable_name ) ) end |
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ LazyFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This is syntactic sugar for a left/inner join that preserves the order
of the left DataFrame
by default, with an optional coalesce when
include_nulls: False
.
Update the values in this LazyFrame
with the values in other
.
4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 |
# File 'lib/polars/lazy_frame.rb', line 4373 def update( other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left" ) Utils.require_same_type(self, other) if ["outer", "outer_coalesce"].include?(how) how = "full" end if !["left", "inner", "full"].include?(how) msg = "`how` must be one of {{'left', 'inner', 'full'}}; found #{how.inspect}" raise ArgumentError, msg end slf = self row_index_used = false if on.nil? if left_on.nil? && right_on.nil? # no keys provided--use row index row_index_used = true row_index_name = "__POLARS_ROW_INDEX" slf = slf.with_row_index(name: row_index_name) other = other.with_row_index(name: row_index_name) left_on = right_on = [row_index_name] else # one of left or right is missing, raise error if left_on.nil? msg = "missing join columns for left frame" raise ArgumentError, msg end if right_on.nil? msg = "missing join columns for right frame" raise ArgumentError, msg end end else # move on into left/right_on to simplify logic left_on = right_on = on end if left_on.is_a?(::String) left_on = [left_on] end if right_on.is_a?(::String) right_on = [right_on] end left_schema = slf.collect_schema left_on.each do |name| if !left_schema.include?(name) msg = "left join column #{name.inspect} not found" raise ArgumentError, msg end end right_schema = other.collect_schema right_on.each do |name| if !right_schema.include?(name) msg = "right join column #{name.inspect} not found" raise ArgumentError, msg end end # no need to join if *only* join columns are in other (inner/left update only) if how != "full" && right_schema.length == right_on.length if row_index_used return slf.drop(row_index_name) end return slf end # only use non-idx right columns present in left frame right_other = Set.new(right_schema.to_h.keys).intersection(left_schema.to_h.keys) - Set.new(right_on) # When include_nulls is True, we need to distinguish records after the join that # were originally null in the right frame, as opposed to records that were null # because the key was missing from the right frame. # Add a validity column to track whether row was matched or not. if include_nulls validity = ["__POLARS_VALIDITY"] other = other.with_columns(F.lit(true).alias(validity[0])) else validity = [] end tmp_name = "__POLARS_RIGHT" drop_columns = right_other.map { |name| "#{name}#{tmp_name}" } + validity result = ( slf.join( other.select(*right_on, *right_other, *validity), left_on: left_on, right_on: right_on, how: how, suffix: tmp_name, coalesce: true, maintain_order: maintain_order ) .with_columns( right_other.map do |name| ( if include_nulls # use left value only when right value failed to join F.when(F.col(validity).is_null) .then(F.col(name)) .otherwise(F.col("#{name}#{tmp_name}")) else F.coalesce(["#{name}#{tmp_name}", F.col(name)]) end ).alias(name) end ) .drop(drop_columns) ) if row_index_used result = result.drop(row_index_name) end _from_rbldf(result._ldf) end |
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
3673 3674 3675 |
# File 'lib/polars/lazy_frame.rb', line 3673 def var(ddof: 1) _from_rbldf(_ldf.var(ddof)) end |
#width ⇒ Integer
Get the width of the LazyFrame.
152 153 154 |
# File 'lib/polars/lazy_frame.rb', line 152 def width _ldf.collect_schema.length end |
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
2955 2956 2957 |
# File 'lib/polars/lazy_frame.rb', line 2955 def with_column(column) with_columns([column]) end |
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
2842 2843 2844 2845 2846 2847 2848 |
# File 'lib/polars/lazy_frame.rb', line 2842 def with_columns(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_into_list_of_expressions(*exprs, **named_exprs, __structify: structify) _from_rbldf(_ldf.with_columns(rbexprs)) end |
#with_columns_seq(*exprs, **named_exprs) ⇒ LazyFrame
Add columns to this LazyFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 |
# File 'lib/polars/lazy_frame.rb', line 2866 def with_columns_seq( *exprs, **named_exprs ) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", 0).to_i != 0 rbexprs = Utils.parse_into_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.with_columns_seq(rbexprs)) end |
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
This allows expressions to also access columns from DataFrames that are not part of this one.
2907 2908 2909 2910 2911 2912 2913 |
# File 'lib/polars/lazy_frame.rb', line 2907 def with_context(other) if !other.is_a?(::Array) other = [other] end _from_rbldf(_ldf.with_context(other.map(&:_ldf))) end |
#with_row_index(name: "index", offset: 0) ⇒ LazyFrame Also known as: with_row_count
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Add a column at index 0 that counts the rows.
3475 3476 3477 |
# File 'lib/polars/lazy_frame.rb', line 3475 def with_row_index(name: "index", offset: 0) _from_rbldf(_ldf.with_row_index(name, offset)) end |
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
185 186 187 188 189 190 191 |
# File 'lib/polars/lazy_frame.rb', line 185 def write_json(file) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end _ldf.write_json(file) nil end |