Class: Polars::LazyFrame
- Inherits:
-
Object
- Object
- Polars::LazyFrame
- Defined in:
- lib/polars/lazy_frame.rb
Overview
Representation of a Lazy computation graph/query against a DataFrame.
Class Method Summary collapse
-
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
Instance Method Summary collapse
-
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
-
#clear(n = 0) ⇒ LazyFrame
(also: #cleared)
Create an empty copy of the current LazyFrame.
-
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
-
#columns ⇒ Array
Get or set column names.
-
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
-
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
-
#drop(columns) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
-
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop rows with null values from this LazyFrame.
-
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
-
#explode(columns) ⇒ LazyFrame
Explode lists to long format.
-
#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
-
#fill_nan(fill_value) ⇒ LazyFrame
Fill floating point NaN values.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
-
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#first ⇒ LazyFrame
Get the first row of the DataFrame.
-
#group_by(by, maintain_order: false) ⇒ LazyGroupBy
(also: #groupby, #group)
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window", check_sorted: true) ⇒ DataFrame
(also: #groupby_dynamic)
Group based on a time value (or index value of type
:i32
,:i64
). -
#head(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
constructor
Create a new LazyFrame.
-
#interpolate ⇒ LazyFrame
Interpolate intermediate values.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", join_nulls: false, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Add a join operation to the Logical Plan.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Perform an asof join.
-
#last ⇒ LazyFrame
Get the last row of the DataFrame.
-
#lazy ⇒ LazyFrame
Return lazy representation, i.e.
-
#limit(n = 5) ⇒ LazyFrame
Get the first
n
rows. -
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
-
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
-
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
-
#melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
Unpivot a DataFrame from wide to long format.
-
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
-
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
-
#rename(mapping) ⇒ LazyFrame
Rename column names.
-
#reverse ⇒ LazyFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil, check_sorted: true) ⇒ LazyFrame
(also: #group_by_rolling, #groupby_rolling)
Create rolling groups based on a time column.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
-
#set_sorted(column, *more_columns, descending: false) ⇒ LazyFrame
Indicate that one or multiple columns are sorted.
-
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
-
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
-
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
-
#sink_ipc(path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
-
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
-
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true) ⇒ DataFrame
Persists a LazyFrame at the provided path.
-
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
-
#sort(by, reverse: false, nulls_last: false, maintain_order: false) ⇒ LazyFrame
Sort the DataFrame.
-
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
-
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
-
#tail(n = 5) ⇒ LazyFrame
Get the last
n
rows. -
#take_every(n) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
-
#to_s ⇒ String
Returns a string representing the LazyFrame.
-
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
-
#unnest(names) ⇒ LazyFrame
Decompose a struct into its fields.
-
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
-
#width ⇒ Integer
Get the width of the LazyFrame.
-
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
-
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
-
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
-
#with_row_index(name: "row_nr", offset: 0) ⇒ LazyFrame
(also: #with_row_count)
Add a column at index 0 that counts the rows.
-
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ LazyFrame
Create a new LazyFrame.
8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# File 'lib/polars/lazy_frame.rb', line 8 def initialize(data = nil, schema: nil, schema_overrides: nil, orient: nil, infer_schema_length: 100, nan_to_null: false) self._ldf = ( DataFrame.new( data, schema: schema, schema_overrides: schema_overrides, orient: orient, infer_schema_length: infer_schema_length, nan_to_null: nan_to_null ) .lazy ._ldf ) end |
Class Method Details
.read_json(file) ⇒ LazyFrame
Read a logical plan from a JSON file to construct a LazyFrame.
178 179 180 181 182 183 184 |
# File 'lib/polars/lazy_frame.rb', line 178 def self.read_json(file) if Utils.pathlike?(file) file = Utils.normalise_filepath(file) end Utils.wrap_ldf(RbLazyFrame.read_json(file)) end |
Instance Method Details
#cache ⇒ LazyFrame
Cache the result once the execution of the physical plan hits this node.
959 960 961 |
# File 'lib/polars/lazy_frame.rb', line 959 def cache _from_rbldf(_ldf.cache) end |
#clear(n = 0) ⇒ LazyFrame Also known as: cleared
Create an empty copy of the current LazyFrame.
The copy has an identical schema but no data.
1003 1004 1005 |
# File 'lib/polars/lazy_frame.rb', line 1003 def clear(n = 0) DataFrame.new(columns: schema).clear(n).lazy end |
#collect(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false) ⇒ DataFrame
Collect into a DataFrame.
Note: use #fetch if you want to run your query on the first n
rows
only. This can be a huge time saver in debugging queries.
465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 |
# File 'lib/polars/lazy_frame.rb', line 465 def collect( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false, _eager: false ) if no_optimization predicate_pushdown = false projection_pushdown = false slice_pushdown = false common_subplan_elimination = false comm_subexpr_elim = false end if allow_streaming common_subplan_elimination = false end ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, _eager ) Utils.wrap_df(ldf.collect) end |
#columns ⇒ Array
Get or set column names.
204 205 206 |
# File 'lib/polars/lazy_frame.rb', line 204 def columns _ldf.columns end |
#describe_optimized_plan(type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ String
Create a string representation of the optimized query plan.
338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 |
# File 'lib/polars/lazy_frame.rb', line 338 def describe_optimized_plan( type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false ) ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, false ) ldf.describe_optimized_plan end |
#describe_plan ⇒ String
Create a string representation of the unoptimized query plan.
331 332 333 |
# File 'lib/polars/lazy_frame.rb', line 331 def describe_plan _ldf.describe_plan end |
#drop(columns) ⇒ LazyFrame
Remove one or multiple columns from a DataFrame.
1999 2000 2001 2002 2003 2004 |
# File 'lib/polars/lazy_frame.rb', line 1999 def drop(columns) if columns.is_a?(::String) columns = [columns] end _from_rbldf(_ldf.drop(columns)) end |
#drop_nulls(subset: nil) ⇒ LazyFrame
Drop rows with null values from this LazyFrame.
2586 2587 2588 2589 2590 2591 |
# File 'lib/polars/lazy_frame.rb', line 2586 def drop_nulls(subset: nil) if !subset.nil? && !subset.is_a?(::Array) subset = [subset] end _from_rbldf(_ldf.drop_nulls(subset)) end |
#dtypes ⇒ Array
Get dtypes of columns in LazyFrame.
222 223 224 |
# File 'lib/polars/lazy_frame.rb', line 222 def dtypes _ldf.dtypes end |
#explode(columns) ⇒ LazyFrame
Explode lists to long format.
2534 2535 2536 2537 |
# File 'lib/polars/lazy_frame.rb', line 2534 def explode(columns) columns = Utils.selection_to_rbexpr_list(columns) _from_rbldf(_ldf.explode(columns)) end |
#fetch(n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false) ⇒ DataFrame
Collect a small number of rows for debugging purposes.
Fetch is like a #collect operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.
Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.
902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 |
# File 'lib/polars/lazy_frame.rb', line 902 def fetch( n_rows = 500, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, string_cache: false, no_optimization: false, slice_pushdown: true, common_subplan_elimination: true, comm_subexpr_elim: true, allow_streaming: false ) if no_optimization predicate_pushdown = false projection_pushdown = false slice_pushdown = false common_subplan_elimination = false end ldf = _ldf.optimization_toggle( type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, common_subplan_elimination, comm_subexpr_elim, allow_streaming, false ) Utils.wrap_df(ldf.fetch(n_rows)) end |
#fill_nan(fill_value) ⇒ LazyFrame
Note that floating point NaN (Not a Number) are not missing values!
To replace missing values, use fill_null
instead.
Fill floating point NaN values.
2309 2310 2311 2312 2313 2314 |
# File 'lib/polars/lazy_frame.rb', line 2309 def fill_nan(fill_value) if !fill_value.is_a?(Expr) fill_value = Utils.lit(fill_value) end _from_rbldf(_ldf.fill_nan(fill_value._rbexpr)) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) ⇒ LazyFrame
Fill null values using the specified value or strategy.
2274 2275 2276 |
# File 'lib/polars/lazy_frame.rb', line 2274 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: nil) select(Polars.all.fill_null(value, strategy: strategy, limit: limit)) end |
#filter(predicate) ⇒ LazyFrame
Filter the rows in the DataFrame based on a predicate expression.
1046 1047 1048 1049 1050 1051 1052 |
# File 'lib/polars/lazy_frame.rb', line 1046 def filter(predicate) _from_rbldf( _ldf.filter( Utils.expr_to_lit_or_expr(predicate, str_to_lit: false)._rbexpr ) ) end |
#first ⇒ LazyFrame
Get the first row of the DataFrame.
2209 2210 2211 |
# File 'lib/polars/lazy_frame.rb', line 2209 def first slice(0, 1) end |
#group_by(by, maintain_order: false) ⇒ LazyGroupBy Also known as: groupby, group
Start a group by operation.
1181 1182 1183 1184 1185 |
# File 'lib/polars/lazy_frame.rb', line 1181 def group_by(by, maintain_order: false) rbexprs_by = Utils.selection_to_rbexpr_list(by) lgb = _ldf.group_by(rbexprs_by, maintain_order) LazyGroupBy.new(lgb) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window", check_sorted: true) ⇒ DataFrame Also known as: groupby_dynamic
Group based on a time value (or index value of type :i32
, :i64
).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every
, period
and offset
arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 |
# File 'lib/polars/lazy_frame.rb', line 1536 def group_by_dynamic( index_column, every:, period: nil, offset: nil, truncate: nil, include_boundaries: false, closed: "left", label: "left", by: nil, start_by: "window", check_sorted: true ) if !truncate.nil? label = truncate ? "left" : "datapoint" end index_column = Utils.expr_to_lit_or_expr(index_column, str_to_lit: false) if offset.nil? offset = period.nil? ? "-#{every}" : "0ns" end if period.nil? period = every end period = Utils._timedelta_to_pl_duration(period) offset = Utils._timedelta_to_pl_duration(offset) every = Utils._timedelta_to_pl_duration(every) rbexprs_by = by.nil? ? [] : Utils.selection_to_rbexpr_list(by) lgb = _ldf.group_by_dynamic( index_column._rbexpr, every, period, offset, label, include_boundaries, closed, rbexprs_by, start_by, check_sorted ) LazyGroupBy.new(lgb) end |
#head(n = 5) ⇒ LazyFrame
2185 2186 2187 |
# File 'lib/polars/lazy_frame.rb', line 2185 def head(n = 5) slice(0, n) end |
#include?(key) ⇒ Boolean
Check if LazyFrame includes key.
259 260 261 |
# File 'lib/polars/lazy_frame.rb', line 259 def include?(key) columns.include?(key) end |
#interpolate ⇒ LazyFrame
Interpolate intermediate values. The interpolation method is linear.
2687 2688 2689 |
# File 'lib/polars/lazy_frame.rb', line 2687 def interpolate select(Utils.col("*").interpolate) end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", join_nulls: false, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Add a join operation to the Logical Plan.
1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 |
# File 'lib/polars/lazy_frame.rb', line 1822 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", join_nulls: false, allow_parallel: true, force_parallel: false ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if how == "cross" return _from_rbldf( _ldf.join( other._ldf, [], [], allow_parallel, force_parallel, how, suffix ) ) end if !on.nil? rbexprs = Utils.selection_to_rbexpr_list(on) rbexprs_left = rbexprs rbexprs_right = rbexprs elsif !left_on.nil? && !right_on.nil? rbexprs_left = Utils.selection_to_rbexpr_list(left_on) rbexprs_right = Utils.selection_to_rbexpr_list(right_on) else raise ArgumentError, "must specify `on` OR `left_on` and `right_on`" end _from_rbldf( self._ldf.join( other._ldf, rbexprs_left, rbexprs_right, allow_parallel, force_parallel, join_nulls, how, suffix, ) ) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false) ⇒ LazyFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the join_asof key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 |
# File 'lib/polars/lazy_frame.rb', line 1645 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false ) if !other.is_a?(LazyFrame) raise ArgumentError, "Expected a `LazyFrame` as join table, got #{other.class.name}" end if on.is_a?(::String) left_on = on right_on = on end if left_on.nil? || right_on.nil? raise ArgumentError, "You should pass the column to join on as an argument." end if by_left.is_a?(::String) || by_left.is_a?(Expr) by_left_ = [by_left] else by_left_ = by_left end if by_right.is_a?(::String) || by_right.is_a?(Expr) by_right_ = [by_right] else by_right_ = by_right end if by.is_a?(::String) by_left_ = [by] by_right_ = [by] elsif by.is_a?(::Array) by_left_ = by by_right_ = by end tolerance_str = nil tolerance_num = nil if tolerance.is_a?(::String) tolerance_str = tolerance else tolerance_num = tolerance end _from_rbldf( _ldf.join_asof( other._ldf, Polars.col(left_on)._rbexpr, Polars.col(right_on)._rbexpr, by_left_, by_right_, allow_parallel, force_parallel, suffix, strategy, tolerance_num, tolerance_str ) ) end |
#last ⇒ LazyFrame
Get the last row of the DataFrame.
2202 2203 2204 |
# File 'lib/polars/lazy_frame.rb', line 2202 def last tail(1) end |
#lazy ⇒ LazyFrame
Return lazy representation, i.e. itself.
Useful for writing code that expects either a DataFrame
or
LazyFrame
.
952 953 954 |
# File 'lib/polars/lazy_frame.rb', line 952 def lazy self end |
#limit(n = 5) ⇒ LazyFrame
2170 2171 2172 |
# File 'lib/polars/lazy_frame.rb', line 2170 def limit(n = 5) head(5) end |
#max ⇒ LazyFrame
Aggregate the columns in the DataFrame to their maximum value.
2396 2397 2398 |
# File 'lib/polars/lazy_frame.rb', line 2396 def max _from_rbldf(_ldf.max) end |
#mean ⇒ LazyFrame
Aggregate the columns in the DataFrame to their mean value.
2456 2457 2458 |
# File 'lib/polars/lazy_frame.rb', line 2456 def mean _from_rbldf(_ldf.mean) end |
#median ⇒ LazyFrame
Aggregate the columns in the DataFrame to their median value.
2476 2477 2478 |
# File 'lib/polars/lazy_frame.rb', line 2476 def median _from_rbldf(_ldf.median) end |
#melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true) ⇒ LazyFrame
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are "unpivoted" to the row axis, leaving just two non-identifier columns, 'variable' and 'value'.
2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 |
# File 'lib/polars/lazy_frame.rb', line 2641 def melt(id_vars: nil, value_vars: nil, variable_name: nil, value_name: nil, streamable: true) if value_vars.is_a?(::String) value_vars = [value_vars] end if id_vars.is_a?(::String) id_vars = [id_vars] end if value_vars.nil? value_vars = [] end if id_vars.nil? id_vars = [] end _from_rbldf( _ldf.melt(id_vars, value_vars, value_name, variable_name, streamable) ) end |
#merge_sorted(other, key) ⇒ LazyFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
2787 2788 2789 |
# File 'lib/polars/lazy_frame.rb', line 2787 def merge_sorted(other, key) _from_rbldf(_ldf.merge_sorted(other._ldf, key)) end |
#min ⇒ LazyFrame
Aggregate the columns in the DataFrame to their minimum value.
2416 2417 2418 |
# File 'lib/polars/lazy_frame.rb', line 2416 def min _from_rbldf(_ldf.min) end |
#pipe(func, *args, **kwargs, &block) ⇒ LazyFrame
Offers a structured way to apply a sequence of user-defined functions (UDFs).
324 325 326 |
# File 'lib/polars/lazy_frame.rb', line 324 def pipe(func, *args, **kwargs, &block) func.call(self, *args, **kwargs, &block) end |
#quantile(quantile, interpolation: "nearest") ⇒ LazyFrame
Aggregate the columns in the DataFrame to their quantile value.
2501 2502 2503 2504 |
# File 'lib/polars/lazy_frame.rb', line 2501 def quantile(quantile, interpolation: "nearest") quantile = Utils.expr_to_lit_or_expr(quantile, str_to_lit: false) _from_rbldf(_ldf.quantile(quantile._rbexpr, interpolation)) end |
#rename(mapping) ⇒ LazyFrame
Rename column names.
2012 2013 2014 2015 2016 |
# File 'lib/polars/lazy_frame.rb', line 2012 def rename(mapping) existing = mapping.keys _new = mapping.values _from_rbldf(_ldf.rename(existing, _new)) end |
#reverse ⇒ LazyFrame
Reverse the DataFrame.
2021 2022 2023 |
# File 'lib/polars/lazy_frame.rb', line 2021 def reverse _from_rbldf(_ldf.reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil, check_sorted: true) ⇒ LazyFrame Also known as: group_by_rolling, groupby_rolling
Create rolling groups based on a time column.
Also works for index values of type :i32
or :i64
.
Different from a dynamic_group_by
the windows are now determined by the
individual values and are not of constant intervals. For constant intervals
use group_by_dynamic.
The period
and offset
arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 |
# File 'lib/polars/lazy_frame.rb', line 1279 def rolling( index_column:, period:, offset: nil, closed: "right", by: nil, check_sorted: true ) index_column = Utils.parse_as_expression(index_column) if offset.nil? offset = "-#{period}" end rbexprs_by = by.nil? ? [] : Utils.selection_to_rbexpr_list(by) period = Utils._timedelta_to_pl_duration(period) offset = Utils._timedelta_to_pl_duration(offset) lgb = _ldf.rolling( index_column, period, offset, closed, rbexprs_by, check_sorted ) LazyGroupBy.new(lgb) end |
#schema ⇒ Hash
Get the schema.
240 241 242 |
# File 'lib/polars/lazy_frame.rb', line 240 def schema _ldf.schema end |
#select(*exprs, **named_exprs) ⇒ LazyFrame
Select columns from this DataFrame.
1142 1143 1144 1145 1146 1147 1148 1149 |
# File 'lib/polars/lazy_frame.rb', line 1142 def select(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_as_list_of_expressions( *exprs, **named_exprs, __structify: structify ) _from_rbldf(_ldf.select(rbexprs)) end |
#set_sorted(column, *more_columns, descending: false) ⇒ LazyFrame
Indicate that one or multiple columns are sorted.
2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 |
# File 'lib/polars/lazy_frame.rb', line 2801 def set_sorted( column, *more_columns, descending: false ) columns = Utils.selection_to_rbexpr_list(column) if more_columns.any? columns.concat(Utils.selection_to_rbexpr_list(more_columns)) end with_columns( columns.map { |e| Utils.wrap_expr(e).set_sorted(descending: descending) } ) end |
#shift(n, fill_value: nil) ⇒ LazyFrame
Shift the values by a given period.
2067 2068 2069 2070 2071 2072 2073 |
# File 'lib/polars/lazy_frame.rb', line 2067 def shift(n, fill_value: nil) if !fill_value.nil? fill_value = Utils.parse_as_expression(fill_value, str_as_lit: true) end n = Utils.parse_as_expression(n) _from_rbldf(_ldf.shift(n, fill_value)) end |
#shift_and_fill(periods, fill_value) ⇒ LazyFrame
Shift the values by a given period and fill the resulting null values.
2117 2118 2119 |
# File 'lib/polars/lazy_frame.rb', line 2117 def shift_and_fill(periods, fill_value) shift(periods, fill_value: fill_value) end |
#sink_csv(path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 |
# File 'lib/polars/lazy_frame.rb', line 720 def sink_csv( path, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil, quote_style: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) Utils._check_arg_is_1byte("separator", separator, false) Utils._check_arg_is_1byte("quote_char", quote_char, false) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_csv( path, include_bom, include_header, separator.ord, line_terminator, quote_char.ord, batch_size, datetime_format, date_format, time_format, float_precision, null_value, quote_style, maintain_order ) end |
#sink_ipc(path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 |
# File 'lib/polars/lazy_frame.rb', line 618 def sink_ipc( path, compression: "zstd", maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_ipc( path, compression, maintain_order ) end |
#sink_ndjson(path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false) ⇒ DataFrame
Evaluate the query in streaming mode and write to an NDJSON file.
This allows streaming results that are larger than RAM to be written to disk.
799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 |
# File 'lib/polars/lazy_frame.rb', line 799 def sink_ndjson( path, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, slice_pushdown: true, no_optimization: false ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_json(path, maintain_order) end |
#sink_parquet(path, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true) ⇒ DataFrame
Persists a LazyFrame at the provided path.
This allows streaming results that are larger than RAM to be written to disk.
553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 |
# File 'lib/polars/lazy_frame.rb', line 553 def sink_parquet( path, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_pagesize_limit: nil, maintain_order: true, type_coercion: true, predicate_pushdown: true, projection_pushdown: true, simplify_expression: true, no_optimization: false, slice_pushdown: true ) lf = _set_sink_optimizations( type_coercion: type_coercion, predicate_pushdown: predicate_pushdown, projection_pushdown: projection_pushdown, simplify_expression: simplify_expression, slice_pushdown: slice_pushdown, no_optimization: no_optimization ) lf.sink_parquet( path, compression, compression_level, statistics, row_group_size, data_pagesize_limit, maintain_order ) end |
#slice(offset, length = nil) ⇒ LazyFrame
Get a slice of this DataFrame.
2150 2151 2152 2153 2154 2155 |
# File 'lib/polars/lazy_frame.rb', line 2150 def slice(offset, length = nil) if length && length < 0 raise ArgumentError, "Negative slice lengths (#{length}) are invalid for LazyFrame" end _from_rbldf(_ldf.slice(offset, length)) end |
#sort(by, reverse: false, nulls_last: false, maintain_order: false) ⇒ LazyFrame
Sort the DataFrame.
Sorting can be done by:
- A single column name
- An expression
- Multiple expressions
403 404 405 406 407 408 409 410 411 412 413 |
# File 'lib/polars/lazy_frame.rb', line 403 def sort(by, reverse: false, nulls_last: false, maintain_order: false) if by.is_a?(::String) return _from_rbldf(_ldf.sort(by, reverse, nulls_last, maintain_order)) end if Utils.bool?(reverse) reverse = [reverse] end by = Utils.selection_to_rbexpr_list(by) _from_rbldf(_ldf.sort_by_exprs(by, reverse, nulls_last, maintain_order)) end |
#std(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their standard deviation value.
2344 2345 2346 |
# File 'lib/polars/lazy_frame.rb', line 2344 def std(ddof: 1) _from_rbldf(_ldf.std(ddof)) end |
#sum ⇒ LazyFrame
Aggregate the columns in the DataFrame to their sum value.
2436 2437 2438 |
# File 'lib/polars/lazy_frame.rb', line 2436 def sum _from_rbldf(_ldf.sum) end |
#tail(n = 5) ⇒ LazyFrame
Get the last n
rows.
2195 2196 2197 |
# File 'lib/polars/lazy_frame.rb', line 2195 def tail(n = 5) _from_rbldf(_ldf.tail(n)) end |
#take_every(n) ⇒ LazyFrame
Take every nth row in the LazyFrame and return as a new LazyFrame.
2267 2268 2269 |
# File 'lib/polars/lazy_frame.rb', line 2267 def take_every(n) select(Utils.col("*").take_every(n)) end |
#to_s ⇒ String
Returns a string representing the LazyFrame.
271 272 273 274 275 276 277 |
# File 'lib/polars/lazy_frame.rb', line 271 def to_s <<~EOS naive plan: (run LazyFrame#describe_optimized_plan to see the optimized plan) #{describe_plan} EOS end |
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ LazyFrame
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List
in the DataFrame or
subset.
2553 2554 2555 2556 2557 2558 |
# File 'lib/polars/lazy_frame.rb', line 2553 def unique(maintain_order: true, subset: nil, keep: "first") if !subset.nil? && !subset.is_a?(::Array) subset = [subset] end _from_rbldf(_ldf.unique(maintain_order, subset, keep)) end |
#unnest(names) ⇒ LazyFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame
on the location of the
struct
type.
2742 2743 2744 2745 2746 2747 |
# File 'lib/polars/lazy_frame.rb', line 2742 def unnest(names) if names.is_a?(::String) names = [names] end _from_rbldf(_ldf.unnest(names)) end |
#var(ddof: 1) ⇒ LazyFrame
Aggregate the columns in the DataFrame to their variance value.
2376 2377 2378 |
# File 'lib/polars/lazy_frame.rb', line 2376 def var(ddof: 1) _from_rbldf(_ldf.var(ddof)) end |
#width ⇒ Integer
Get the width of the LazyFrame.
252 253 254 |
# File 'lib/polars/lazy_frame.rb', line 252 def width _ldf.width end |
#with_column(column) ⇒ LazyFrame
Add or overwrite column in a DataFrame.
1988 1989 1990 |
# File 'lib/polars/lazy_frame.rb', line 1988 def with_column(column) with_columns([column]) end |
#with_columns(*exprs, **named_exprs) ⇒ LazyFrame
Add or overwrite multiple columns in a DataFrame.
1904 1905 1906 1907 1908 1909 |
# File 'lib/polars/lazy_frame.rb', line 1904 def with_columns(*exprs, **named_exprs) structify = ENV.fetch("POLARS_AUTO_STRUCTIFY", "0") != "0" rbexprs = Utils.parse_as_list_of_expressions(*exprs, **named_exprs, __structify: structify) _from_rbldf(_ldf.with_columns(rbexprs)) end |
#with_context(other) ⇒ LazyFrame
Add an external context to the computation graph.
This allows expressions to also access columns from DataFrames that are not part of this one.
1940 1941 1942 1943 1944 1945 1946 |
# File 'lib/polars/lazy_frame.rb', line 1940 def with_context(other) if !other.is_a?(::Array) other = [other] end _from_rbldf(_ldf.with_context(other.map(&:_ldf))) end |
#with_row_index(name: "row_nr", offset: 0) ⇒ LazyFrame Also known as: with_row_count
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Add a column at index 0 that counts the rows.
2245 2246 2247 |
# File 'lib/polars/lazy_frame.rb', line 2245 def with_row_index(name: "row_nr", offset: 0) _from_rbldf(_ldf.with_row_index(name, offset)) end |
#write_json(file) ⇒ nil
Write the logical plan of this LazyFrame to a file or string in JSON format.
285 286 287 288 289 290 291 |
# File 'lib/polars/lazy_frame.rb', line 285 def write_json(file) if Utils.pathlike?(file) file = Utils.normalise_filepath(file) end _ldf.write_json(file) nil end |