Class: Polars::DataFrame
- Inherits:
-
Object
- Object
- Polars::DataFrame
- Defined in:
- lib/polars/data_frame.rb
Overview
Two-dimensional data structure representing data as a table with rows and columns.
Class Method Summary collapse
-
.deserialize(source) ⇒ DataFrame
Read a serialized DataFrame from a file.
Instance Method Summary collapse
-
#!=(other) ⇒ DataFrame
Not equal.
-
#%(other) ⇒ DataFrame
Returns the modulo.
-
#*(other) ⇒ DataFrame
Performs multiplication.
-
#+(other) ⇒ DataFrame
Performs addition.
-
#-(other) ⇒ DataFrame
Performs subtraction.
-
#/(other) ⇒ DataFrame
Performs division.
-
#<(other) ⇒ DataFrame
Less than.
-
#<=(other) ⇒ DataFrame
Less than or equal.
-
#==(other) ⇒ DataFrame
Equal.
-
#>(other) ⇒ DataFrame
Greater than.
-
#>=(other) ⇒ DataFrame
Greater than or equal.
-
#[](*key) ⇒ Object
Returns subset of the DataFrame.
-
#[]=(*key, value) ⇒ Object
Set item.
-
#bottom_k(k, by:, reverse: false) ⇒ DataFrame
Return the
ksmallest rows. -
#cast(dtypes, strict: true) ⇒ DataFrame
Cast DataFrame column(s) to the specified dtype(s).
-
#clear(n = 0) ⇒ DataFrame
Create an empty copy of the current DataFrame.
-
#collect_schema ⇒ Schema
Get an ordered mapping of column names to their data type.
-
#columns ⇒ Array
Get column names.
-
#columns=(columns) ⇒ Object
Change the column names of the DataFrame.
-
#delete(name) ⇒ Series
Drop in place if exists.
-
#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame
Summary statistics for a DataFrame.
-
#drop(*columns, strict: true) ⇒ DataFrame
Remove column from DataFrame and return as new.
-
#drop_in_place(name) ⇒ Series
Drop in place.
-
#drop_nans(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more NaN values.
-
#drop_nulls(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more null values.
-
#dtypes ⇒ Array
Get dtypes of columns in DataFrame.
-
#each(&block) ⇒ Object
Returns an enumerator.
-
#each_row(named: true, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
-
#equals(other, null_equal: true) ⇒ Boolean
Check if DataFrame is equal to other.
-
#estimated_size(unit = "b") ⇒ Numeric
Return an estimation of the total (heap) allocated size of the DataFrame.
-
#explode(columns, *more_columns) ⇒ DataFrame
Explode
DataFrameto long format by exploding a column with Lists. -
#extend(other) ⇒ DataFrame
Extend the memory backed by this
DataFramewith the values fromother. -
#fill_nan(value) ⇒ DataFrame
Fill floating point NaN values by an Expression evaluation.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame
Fill null values using the specified value or strategy.
-
#filter(*predicates, **constraints) ⇒ DataFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#flags ⇒ Hash
Get flags that are set on the columns of this DataFrame.
-
#fold ⇒ Series
Apply a horizontal reduction on a DataFrame.
-
#gather_every(n, offset = 0) ⇒ DataFrame
Take every nth row in the DataFrame and return as a new DataFrame.
-
#get_column(name, default: NO_DEFAULT) ⇒ Series
Get a single column by name.
-
#get_column_index(name) ⇒ Series
Find the index of a column by name.
-
#get_columns ⇒ Array
Get the DataFrame as a Array of Series.
-
#glimpse(max_items_per_column: 10, max_colname_length: 50, return_type: nil) ⇒ Object
Return a dense preview of the DataFrame.
-
#group_by(by, maintain_order: false, **named_by) ⇒ GroupBy
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame
Group based on a time value (or index value of type Int32, Int64).
-
#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series
Hash and combine the rows in this DataFrame.
-
#head(n = 5) ⇒ DataFrame
Get the first
nrows. -
#height ⇒ Integer
(also: #count, #length, #size)
Get the height of the DataFrame.
-
#hstack(columns, in_place: false) ⇒ DataFrame
Return a new DataFrame grown horizontally by stacking multiple Series to it.
-
#include?(name) ⇒ Boolean
Check if DataFrame includes column.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false) ⇒ DataFrame
constructor
Create a new DataFrame.
-
#insert_column(index, column) ⇒ DataFrame
Insert a Series at a certain column index.
-
#interpolate ⇒ DataFrame
Interpolate intermediate values.
-
#is_duplicated ⇒ Series
Get a mask of all duplicated rows in this DataFrame.
-
#is_empty ⇒ Boolean
(also: #empty?)
Check if the dataframe is empty.
-
#is_unique ⇒ Series
Get a mask of all unique rows in this DataFrame.
-
#item(row = nil, column = nil) ⇒ Object
Return the DataFrame as a scalar, or return the element at the given row/column.
-
#iter_columns ⇒ Object
Returns an iterator over the columns of this DataFrame.
-
#iter_rows(named: false, buffer_size: 512, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
-
#iter_slices(n_rows: 10_000) ⇒ Object
Returns a non-copying iterator of slices over the underlying DataFrame.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame
Join in SQL-like fashion.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame
Perform an asof join.
-
#join_where(other, *predicates, suffix: "_right") ⇒ DataFrame
Perform a join based on one or multiple (in)equality predicates.
-
#lazy ⇒ LazyFrame
Start a lazy query from this point.
-
#limit(n = 5) ⇒ DataFrame
Get the first
nrows. -
#map_rows(return_dtype: nil, inference_size: 256, &function) ⇒ Object
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
-
#max ⇒ DataFrame
Aggregate the columns of this DataFrame to their maximum value.
-
#max_horizontal ⇒ Series
Get the maximum value horizontally across columns.
-
#mean ⇒ DataFrame
Aggregate the columns of this DataFrame to their mean value.
-
#mean_horizontal(ignore_nulls: true) ⇒ Series
Take the mean of all values horizontally across columns.
-
#median ⇒ DataFrame
Aggregate the columns of this DataFrame to their median value.
-
#merge_sorted(other, key) ⇒ DataFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ DataFrame
Aggregate the columns of this DataFrame to their minimum value.
-
#min_horizontal ⇒ Series
Get the minimum value horizontally across columns.
-
#n_chunks(strategy: "first") ⇒ Object
Get number of chunks used by the ChunkedArrays of this DataFrame.
-
#n_unique(subset: nil) ⇒ DataFrame
Return the number of unique rows, or the number of unique row-subsets.
-
#null_count ⇒ DataFrame
Create a new DataFrame that shows the null counts per column.
-
#partition_by(by, *more_by, maintain_order: true, include_key: true, as_dict: false) ⇒ Object
Split into multiple DataFrames partitioned by groups.
-
#pipe(function, *args, **kwargs, &block) ⇒ Object
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
-
#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Object
Plot data.
-
#product ⇒ DataFrame
Aggregate the columns of this DataFrame to their product values.
-
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Aggregate the columns of this DataFrame to their quantile value.
-
#rechunk ⇒ DataFrame
This will make sure all subsequent operations have optimal and predictable performance.
-
#remove(*predicates, **constraints) ⇒ DataFrame
Remove rows, dropping those that match the given predicate expression(s).
-
#rename(mapping, strict: true) ⇒ DataFrame
Rename column names.
-
#replace_column(index, column) ⇒ DataFrame
Replace a column at an index location.
-
#reverse ⇒ DataFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ RollingGroupBy
Create rolling groups based on a time column.
-
#row(index = nil, by_predicate: nil, named: false) ⇒ Object
Get a row as tuple, either by index or by predicate.
-
#rows(named: false) ⇒ Array
Convert columnar data to rows as Ruby arrays.
-
#rows_by_key(key, named: false, include_key: false, unique: false) ⇒ Hash
Convert columnar data to rows as Ruby arrays in a hash keyed by some column.
-
#sample(n: nil, fraction: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame
Sample from this DataFrame.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
-
#select_seq(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
-
#serialize(file = nil) ⇒ Object
Serialize this DataFrame to a file or string.
-
#set_sorted(column, descending: false) ⇒ DataFrame
Flag a column as sorted.
-
#shape ⇒ Array
Get the shape of the DataFrame.
-
#shift(n = 1, fill_value: nil) ⇒ DataFrame
Shift values by the given period.
-
#shrink_to_fit(in_place: false) ⇒ DataFrame
Shrink DataFrame memory usage.
-
#slice(offset, length = nil) ⇒ DataFrame
Get a slice of this DataFrame.
-
#sort(by, *more_by, descending: false, nulls_last: false, multithreaded: true, maintain_order: false) ⇒ DataFrame
Sort the dataframe by the given columns.
-
#sort!(by, descending: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column in-place.
-
#sql(query, table_name: "self") ⇒ DataFrame
Execute a SQL query against the DataFrame.
-
#std(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their standard deviation value.
-
#sum ⇒ DataFrame
Aggregate the columns of this DataFrame to their sum value.
-
#sum_horizontal(ignore_nulls: true) ⇒ Series
Sum all values horizontally across columns.
-
#tail(n = 5) ⇒ DataFrame
Get the last
nrows. -
#to_a ⇒ Array
Returns an array representing the DataFrame.
-
#to_csv(**options) ⇒ String
Write to comma-separated values (CSV) string.
-
#to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) ⇒ DataFrame
Get one hot encoded dummy variables.
-
#to_h(as_series: true) ⇒ Hash
Convert DataFrame to a hash mapping column name to values.
-
#to_hashes ⇒ Array
Convert every row to a hash.
-
#to_numo ⇒ Numo::NArray
Convert DataFrame to a 2D Numo array.
-
#to_s ⇒ String
(also: #inspect)
Returns a string representing the DataFrame.
-
#to_series(index = 0) ⇒ Series
Select column as Series at index location.
-
#to_struct(name = "") ⇒ Series
Convert a
DataFrameto aSeriesof typeStruct. -
#top_k(k, by:, reverse: false) ⇒ DataFrame
Return the
klargest rows. -
#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame
Transpose a DataFrame over the diagonal.
-
#unique(maintain_order: false, subset: nil, keep: "any") ⇒ DataFrame
Drop duplicate rows from this DataFrame.
-
#unnest(columns, *more_columns, separator: nil) ⇒ DataFrame
Decompose a struct into its fields.
-
#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame
Unpivot a DataFrame from wide to long format.
-
#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame
Unstack a long table to a wide form without doing an aggregation.
-
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ DataFrame
Update the values in this
DataFramewith the values inother. -
#upsample(time_column:, every:, group_by: nil, maintain_order: false) ⇒ DataFrame
Upsample a DataFrame at a regular frequency.
-
#var(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their variance value.
-
#vstack(other, in_place: false) ⇒ DataFrame
Grow this DataFrame vertically by stacking a DataFrame to it.
-
#width ⇒ Integer
Get the width of the DataFrame.
-
#with_columns(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
-
#with_columns_seq(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
-
#with_row_index(name: "index", offset: 0) ⇒ DataFrame
Add a column at index 0 that counts the rows.
-
#write_avro(file, compression = "uncompressed", name: "") ⇒ nil
Write to Apache Avro file.
-
#write_csv(file = nil, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, storage_options: nil, credential_provider: "auto", retries: 2) ⇒ String?
Write to comma-separated values (CSV) file.
-
#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer
Write the data in a Polars DataFrame to a database.
-
#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil
Write DataFrame as delta table.
-
#write_iceberg(target, mode:) ⇒ nil
Write DataFrame to an Iceberg table.
-
#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, credential_provider: "auto", retries: 2) ⇒ nil
Write to Arrow IPC binary stream or Feather file.
-
#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object
Write to Arrow IPC record batch stream.
-
#write_json(file = nil) ⇒ nil
Serialize to JSON representation.
-
#write_ndjson(file = nil) ⇒ nil
Serialize to newline delimited JSON representation.
-
#write_parquet(file, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, partition_by: nil, partition_chunk_size_bytes: 4_294_967_296, storage_options: nil, credential_provider: "auto", retries: 2, metadata: nil, mkdir: false) ⇒ nil
Write to Apache Parquet file.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false) ⇒ DataFrame
Create a new DataFrame.
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/polars/data_frame.rb', line 48 def initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: N_INFER_DEFAULT, nan_to_null: false) if defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || data.is_a?(ActiveRecord::Result)) raise ArgumentError, "Use read_database instead" end if data.nil? self._df = Utils.hash_to_rbdf({}, schema: schema, schema_overrides: schema_overrides) elsif data.is_a?(Hash) data = data.transform_keys { |v| v.is_a?(Symbol) ? v.to_s : v } self._df = Utils.hash_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, nan_to_null: nan_to_null) elsif data.is_a?(::Array) self._df = Utils.sequence_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length) elsif data.is_a?(Series) self._df = Utils.series_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict) elsif data.respond_to?(:arrow_c_stream) # This uses the fact that RbSeries.from_arrow_c_stream will create a # struct-typed Series. Then we unpack that to a DataFrame. tmp_col_name = "" s = Utils.wrap_s(RbSeries.from_arrow_c_stream(data)) self._df = s.to_frame(tmp_col_name).unnest(tmp_col_name)._df else raise ArgumentError, "DataFrame constructor called with unsupported type; got #{data.class.name}" end end |
Class Method Details
.deserialize(source) ⇒ DataFrame
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Read a serialized DataFrame from a file.
100 101 102 103 104 105 106 107 108 |
# File 'lib/polars/data_frame.rb', line 100 def self.deserialize(source) if Utils.pathlike?(source) source = Utils.normalize_filepath(source) end deserializer = RbDataFrame.method(:deserialize_binary) _from_rbdf(deserializer.(source)) end |
Instance Method Details
#!=(other) ⇒ DataFrame
Not equal.
299 300 301 |
# File 'lib/polars/data_frame.rb', line 299 def !=(other) _comp(other, "neq") end |
#%(other) ⇒ DataFrame
Returns the modulo.
382 383 384 385 386 387 388 389 |
# File 'lib/polars/data_frame.rb', line 382 def %(other) if other.is_a?(DataFrame) return _from_rbdf(_df.rem_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.rem(other._s)) end |
#*(other) ⇒ DataFrame
Performs multiplication.
334 335 336 337 338 339 340 341 |
# File 'lib/polars/data_frame.rb', line 334 def *(other) if other.is_a?(DataFrame) return _from_rbdf(_df.mul_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.mul(other._s)) end |
#+(other) ⇒ DataFrame
Performs addition.
358 359 360 361 362 363 364 365 |
# File 'lib/polars/data_frame.rb', line 358 def +(other) if other.is_a?(DataFrame) return _from_rbdf(_df.add_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.add(other._s)) end |
#-(other) ⇒ DataFrame
Performs subtraction.
370 371 372 373 374 375 376 377 |
# File 'lib/polars/data_frame.rb', line 370 def -(other) if other.is_a?(DataFrame) return _from_rbdf(_df.sub_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.sub(other._s)) end |
#/(other) ⇒ DataFrame
Performs division.
346 347 348 349 350 351 352 353 |
# File 'lib/polars/data_frame.rb', line 346 def /(other) if other.is_a?(DataFrame) return _from_rbdf(_df.div_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.div(other._s)) end |
#<(other) ⇒ DataFrame
Less than.
313 314 315 |
# File 'lib/polars/data_frame.rb', line 313 def <(other) _comp(other, "lt") end |
#<=(other) ⇒ DataFrame
Less than or equal.
327 328 329 |
# File 'lib/polars/data_frame.rb', line 327 def <=(other) _comp(other, "lt_eq") end |
#==(other) ⇒ DataFrame
Equal.
292 293 294 |
# File 'lib/polars/data_frame.rb', line 292 def ==(other) _comp(other, "eq") end |
#>(other) ⇒ DataFrame
Greater than.
306 307 308 |
# File 'lib/polars/data_frame.rb', line 306 def >(other) _comp(other, "gt") end |
#>=(other) ⇒ DataFrame
Greater than or equal.
320 321 322 |
# File 'lib/polars/data_frame.rb', line 320 def >=(other) _comp(other, "gt_eq") end |
#[](*key) ⇒ Object
Returns subset of the DataFrame.
540 541 542 |
# File 'lib/polars/data_frame.rb', line 540 def [](*key) get_df_item_by_key(self, key) end |
#[]=(*key, value) ⇒ Object
Set item.
593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 |
# File 'lib/polars/data_frame.rb', line 593 def []=(*key, value) if key.empty? || key.length > 2 raise ArgumentError, "wrong number of arguments (given #{key.length + 1}, expected 2..3)" end if key.length == 1 && Utils.strlike?(key[0]) key = key[0] if value.is_a?(::Array) || (defined?(Numo::NArray) && value.is_a?(Numo::NArray)) value = Series.new(value) elsif !value.is_a?(Series) value = Polars.lit(value) end self._df = with_columns(value.alias(key.to_s))._df # df[["C", "D"]] elsif key.length == 1 && key[0].is_a?(::Array) key = key[0] if !value.is_a?(::Array) || !value.all? { |v| v.is_a?(::Array) } msg = "can only set multiple columns with 2D matrix" raise ArgumentError, msg end if value.any? { |v| v.size != key.length } msg = "matrix columns should be equal to list used to determine column names" raise ArgumentError, msg end columns = [] key.each_with_index do |name, i| columns << Series.new(name, value.map { |v| v[i] }) end self._df = with_columns(columns)._df # df[a, b] else row_selection, col_selection = key if (row_selection.is_a?(Series) && row_selection.dtype == Boolean) || Utils.is_bool_sequence(row_selection) msg = ( "not allowed to set DataFrame by boolean mask in the row position" + "\n\nConsider using `DataFrame.with_columns`." ) raise TypeError, msg end # get series column selection if Utils.strlike?(col_selection) s = self[col_selection] elsif col_selection.is_a?(Integer) s = self[0.., col_selection] else msg = "unexpected column selection #{col_selection.inspect}" raise TypeError, msg end # dispatch to []= of Series to do modification s[row_selection] = value # now find the location to place series # df[idx] if col_selection.is_a?(Integer) replace_column(col_selection, s) # df["foo"] elsif Utils.strlike?(col_selection) _replace(col_selection.to_s, s) end end end |
#bottom_k(k, by:, reverse: false) ⇒ DataFrame
Return the k smallest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call sort after this function if you wish the
output to be sorted.
2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 |
# File 'lib/polars/data_frame.rb', line 2404 def bottom_k( k, by:, reverse: false ) lazy .bottom_k(k, by: by, reverse: reverse) .collect( optimizations: QueryOptFlags.new( projection_pushdown: false, predicate_pushdown: false, comm_subplan_elim: false, slice_pushdown: true ) ) end |
#cast(dtypes, strict: true) ⇒ DataFrame
Cast DataFrame column(s) to the specified dtype(s).
4022 4023 4024 |
# File 'lib/polars/data_frame.rb', line 4022 def cast(dtypes, strict: true) lazy.cast(dtypes, strict: strict).collect(optimizations: QueryOptFlags._eager) end |
#clear(n = 0) ⇒ DataFrame
Create an empty copy of the current DataFrame.
Returns a DataFrame with identical schema but no data.
4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 |
# File 'lib/polars/data_frame.rb', line 4062 def clear(n = 0) if n == 0 _from_rbdf(_df.clear) elsif n > 0 || len > 0 self.class.new( schema.to_h { |nm, tp| [nm, Series.new(nm, [], dtype: tp).extend_constant(nil, n)] } ) else clone end end |
#collect_schema ⇒ Schema
This method is included to facilitate writing code that is generic for both DataFrame and LazyFrame.
Get an ordered mapping of column names to their data type.
703 704 705 |
# File 'lib/polars/data_frame.rb', line 703 def collect_schema Schema.new(columns.zip(dtypes), check_dtypes: false) end |
#columns ⇒ Array
Get column names.
209 210 211 |
# File 'lib/polars/data_frame.rb', line 209 def columns _df.columns end |
#columns=(columns) ⇒ Object
Change the column names of the DataFrame.
242 243 244 |
# File 'lib/polars/data_frame.rb', line 242 def columns=(columns) _df.set_column_names(columns) end |
#delete(name) ⇒ Series
Drop in place if exists.
3969 3970 3971 |
# File 'lib/polars/data_frame.rb', line 3969 def delete(name) drop_in_place(name) if include?(name) end |
#describe(percentiles: [0.25, 0.5, 0.75], interpolation: "nearest") ⇒ DataFrame
Summary statistics for a DataFrame.
2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 |
# File 'lib/polars/data_frame.rb', line 2044 def describe( percentiles: [0.25, 0.5, 0.75], interpolation: "nearest" ) if columns.empty? msg = "cannot describe a DataFrame that has no columns" raise TypeError, msg end lazy.describe( percentiles: percentiles, interpolation: interpolation ) end |
#drop(*columns, strict: true) ⇒ DataFrame
Remove column from DataFrame and return as new.
3909 3910 3911 |
# File 'lib/polars/data_frame.rb', line 3909 def drop(*columns, strict: true) lazy.drop(*columns, strict: strict).collect(optimizations: QueryOptFlags._eager) end |
#drop_in_place(name) ⇒ Series
Drop in place.
3937 3938 3939 |
# File 'lib/polars/data_frame.rb', line 3937 def drop_in_place(name) Utils.wrap_s(_df.drop_in_place(name)) end |
#drop_nans(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
2623 2624 2625 |
# File 'lib/polars/data_frame.rb', line 2623 def drop_nans(subset: nil) lazy.drop_nans(subset: subset).collect(optimizations: QueryOptFlags._eager) end |
#drop_nulls(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
2668 2669 2670 |
# File 'lib/polars/data_frame.rb', line 2668 def drop_nulls(subset: nil) lazy.drop_nulls(subset: subset).collect(optimizations: QueryOptFlags._eager) end |
#dtypes ⇒ Array
Get dtypes of columns in DataFrame. Dtypes can also be found in column headers when printing the DataFrame.
260 261 262 |
# File 'lib/polars/data_frame.rb', line 260 def dtypes _df.dtypes end |
#each(&block) ⇒ Object
Returns an enumerator.
416 417 418 |
# File 'lib/polars/data_frame.rb', line 416 def each(&block) get_columns.each(&block) end |
#each_row(named: true, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
6062 6063 6064 |
# File 'lib/polars/data_frame.rb', line 6062 def each_row(named: true, buffer_size: 500, &block) iter_rows(named: named, buffer_size: buffer_size, &block) end |
#equals(other, null_equal: true) ⇒ Boolean
Check if DataFrame is equal to other.
2449 2450 2451 |
# File 'lib/polars/data_frame.rb', line 2449 def equals(other, null_equal: true) _df.equals(other._df, null_equal) end |
#estimated_size(unit = "b") ⇒ Numeric
Return an estimation of the total (heap) allocated size of the DataFrame.
Estimated size is given in the specified unit (bytes by default).
This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, StructArray's size is an upper bound.
When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.
FFI buffers are included in this estimation.
1537 1538 1539 1540 |
# File 'lib/polars/data_frame.rb', line 1537 def estimated_size(unit = "b") sz = _df.estimated_size Utils.scale_bytes(sz, to: unit) end |
#explode(columns, *more_columns) ⇒ DataFrame
Explode DataFrame to long format by exploding a column with Lists.
4329 4330 4331 |
# File 'lib/polars/data_frame.rb', line 4329 def explode(columns, *more_columns) lazy.explode(columns, *more_columns).collect(optimizations: QueryOptFlags._eager) end |
#extend(other) ⇒ DataFrame
Extend the memory backed by this DataFrame with the values from other.
Different from vstack which adds the chunks from other to the chunks of this
DataFrame extend appends the data from other to the underlying memory
locations and thus may cause a reallocation.
If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.
Prefer extend over vstack when you want to do a query after a single append.
For instance during online operations where you add n rows and rerun a query.
Prefer vstack over extend when you want to append many times before doing a
query. For instance when you read in multiple files and when to store them in a
single DataFrame. In the latter case, finish the sequence of vstack
operations with a rechunk.
3846 3847 3848 3849 |
# File 'lib/polars/data_frame.rb', line 3846 def extend(other) _df.extend(other._df) self end |
#fill_nan(value) ⇒ DataFrame
Note that floating point NaNs (Not a Number) are not missing values!
To replace missing values, use fill_null.
Fill floating point NaN values by an Expression evaluation.
4292 4293 4294 |
# File 'lib/polars/data_frame.rb', line 4292 def fill_nan(value) lazy.fill_nan(value).collect(optimizations: QueryOptFlags._eager) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame
Fill null values using the specified value or strategy.
4252 4253 4254 4255 4256 4257 4258 4259 |
# File 'lib/polars/data_frame.rb', line 4252 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) _from_rbdf( lazy .fill_null(value, strategy: strategy, limit: limit, matches_supertype: matches_supertype) .collect(optimizations: QueryOptFlags._eager) ._df ) end |
#filter(*predicates, **constraints) ⇒ DataFrame
Filter the rows in the DataFrame based on a predicate expression.
1763 1764 1765 |
# File 'lib/polars/data_frame.rb', line 1763 def filter(*predicates, **constraints) lazy.filter(*predicates, **constraints).collect(optimizations: QueryOptFlags._eager) end |
#flags ⇒ Hash
Get flags that are set on the columns of this DataFrame.
267 268 269 |
# File 'lib/polars/data_frame.rb', line 267 def flags columns.to_h { |name| [name, self[name].flags] } end |
#fold ⇒ Series
Apply a horizontal reduction on a DataFrame.
This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).
An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:
i8 + str = str f32 + i64 = f32 f32 + f64 = f64
5792 5793 5794 5795 5796 5797 5798 5799 |
# File 'lib/polars/data_frame.rb', line 5792 def fold acc = to_series(0) 1.upto(width - 1) do |i| acc = yield(acc, to_series(i)) end acc end |
#gather_every(n, offset = 0) ⇒ DataFrame
Take every nth row in the DataFrame and return as a new DataFrame.
6183 6184 6185 |
# File 'lib/polars/data_frame.rb', line 6183 def gather_every(n, offset = 0) select(F.col("*").gather_every(n, offset)) end |
#get_column(name, default: NO_DEFAULT) ⇒ Series
Get a single column by name.
4166 4167 4168 4169 4170 4171 |
# File 'lib/polars/data_frame.rb', line 4166 def get_column(name, default: NO_DEFAULT) Utils.wrap_s(_df.get_column(name.to_s)) rescue ColumnNotFoundError raise if default.eql?(NO_DEFAULT) default end |
#get_column_index(name) ⇒ Series
Find the index of a column by name.
2071 2072 2073 |
# File 'lib/polars/data_frame.rb', line 2071 def get_column_index(name) _df.get_column_index(name) end |
#get_columns ⇒ Array
Get the DataFrame as a Array of Series.
4130 4131 4132 |
# File 'lib/polars/data_frame.rb', line 4130 def get_columns _df.get_columns.map { |s| Utils.wrap_s(s) } end |
#glimpse(max_items_per_column: 10, max_colname_length: 50, return_type: nil) ⇒ Object
Return a dense preview of the DataFrame.
The formatting shows one line per column so that wide dataframes display cleanly. Each line shows the column name, the data type, and the first few values.
1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 |
# File 'lib/polars/data_frame.rb', line 1934 def glimpse( max_items_per_column: 10, max_colname_length: 50, return_type: nil ) if return_type.nil? return_frame = false else return_frame = return_type == "frame" if !return_frame && !["self", "string"].include?(return_type) msg = "invalid `return_type`; found #{return_type.inspect}, expected one of 'string', 'frame', 'self', or nil" raise ArgumentError, msg end end # always print at most this number of values (mainly ensures that # we do not cast long arrays to strings, which would be slow) max_n_values = [max_items_per_column, height].min schema = self.schema _column_to_row_output = lambda do |col_name, dtype| fn = schema[col_name] == String ? :inspect : :to_s values = self[0...max_n_values, col_name].to_a if col_name.length > max_colname_length col_name = col_name[0...(max_colname_length - 1)] + "…" end dtype_str = Plr.dtype_str_repr(dtype) if !return_frame dtype_str = "<#{dtype_str}>" end [col_name, dtype_str, values.map { |v| !v.nil? ? v.send(fn) : nil }] end data = self.schema.map { |s, dtype| _column_to_row_output.(s, dtype) } # output one row per column if return_frame DataFrame.new( data, orient: "row", schema: {"column" => String, "dtype" => String, "values" => List.new(String)} ) else raise Todo end end |
#group_by(by, maintain_order: false, **named_by) ⇒ GroupBy
Start a group by operation.
2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 |
# File 'lib/polars/data_frame.rb', line 2778 def group_by(by, maintain_order: false, **named_by) named_by.each do |_, value| if !(value.is_a?(::String) || value.is_a?(Expr) || value.is_a?(Series)) msg = "Expected Polars expression or object convertible to one, got #{value.class.name}." raise TypeError, msg end end GroupBy.new( self, by, **named_by, maintain_order: maintain_order ) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window") ⇒ DataFrame
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every, period and offset arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 |
# File 'lib/polars/data_frame.rb', line 3138 def group_by_dynamic( index_column, every:, period: nil, offset: nil, include_boundaries: false, closed: "left", label: "left", group_by: nil, start_by: "window" ) DynamicGroupBy.new( self, index_column, every, period, offset, include_boundaries, closed, label, group_by, start_by ) end |
#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series
Hash and combine the rows in this DataFrame.
The hash value is of type UInt64.
6219 6220 6221 6222 6223 6224 6225 |
# File 'lib/polars/data_frame.rb', line 6219 def hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) k0 = seed k1 = seed_1.nil? ? seed : seed_1 k2 = seed_2.nil? ? seed : seed_2 k3 = seed_3.nil? ? seed : seed_3 Utils.wrap_s(_df.hash_rows(k0, k1, k2, k3)) end |
#head(n = 5) ⇒ DataFrame
Get the first n rows.
2546 2547 2548 |
# File 'lib/polars/data_frame.rb', line 2546 def head(n = 5) _from_rbdf(_df.head(n)) end |
#height ⇒ Integer Also known as: count, length, size
Get the height of the DataFrame.
176 177 178 |
# File 'lib/polars/data_frame.rb', line 176 def height _df.height end |
#hstack(columns, in_place: false) ⇒ DataFrame
Return a new DataFrame grown horizontally by stacking multiple Series to it.
3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 |
# File 'lib/polars/data_frame.rb', line 3748 def hstack(columns, in_place: false) if !columns.is_a?(::Array) columns = columns.get_columns end if in_place _df.hstack_mut(columns.map(&:_s)) self else _from_rbdf(_df.hstack(columns.map(&:_s))) end end |
#include?(name) ⇒ Boolean
Check if DataFrame includes column.
409 410 411 |
# File 'lib/polars/data_frame.rb', line 409 def include?(name) columns.include?(name) end |
#insert_column(index, column) ⇒ DataFrame
Insert a Series at a certain column index. This operation is in place.
1713 1714 1715 1716 1717 1718 1719 |
# File 'lib/polars/data_frame.rb', line 1713 def insert_column(index, column) if index < 0 index = width + index end _df.insert_column(index, column._s) self end |
#interpolate ⇒ DataFrame
Interpolate intermediate values. The interpolation method is linear.
6252 6253 6254 |
# File 'lib/polars/data_frame.rb', line 6252 def interpolate select(F.col("*").interpolate) end |
#is_duplicated ⇒ Series
Get a mask of all duplicated rows in this DataFrame.
4775 4776 4777 |
# File 'lib/polars/data_frame.rb', line 4775 def is_duplicated Utils.wrap_s(_df.is_duplicated) end |
#is_empty ⇒ Boolean Also known as: empty?
Check if the dataframe is empty.
6266 6267 6268 |
# File 'lib/polars/data_frame.rb', line 6266 def is_empty height == 0 end |
#is_unique ⇒ Series
Get a mask of all unique rows in this DataFrame.
4800 4801 4802 |
# File 'lib/polars/data_frame.rb', line 4800 def is_unique Utils.wrap_s(_df.is_unique) end |
#item(row = nil, column = nil) ⇒ Object
If row/col not provided, this is equivalent to df[0,0], with a check that
the shape is (1,1). With row/col, this is equivalent to df[row,col].
Return the DataFrame as a scalar, or return the element at the given row/column.
732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 |
# File 'lib/polars/data_frame.rb', line 732 def item(row = nil, column = nil) if row.nil? && column.nil? if shape != [1, 1] msg = ( "can only call `.item()` if the dataframe is of shape (1, 1)," + " or if explicit row/col values are provided;" + " frame has shape #{shape.inspect}" ) raise ArgumentError, msg end return _df.to_series(0).get_index(0) elsif row.nil? || column.nil? msg = "cannot call `.item()` with only one of `row` or `column`" raise ArgumentError, msg end s = if column.is_a?(Integer) _df.to_series(column) else _df.get_column(column) end s.get_index_signed(row) end |
#iter_columns ⇒ Object
Consider whether you can use all instead.
If you can, it will be more efficient.
Returns an iterator over the columns of this DataFrame.
6112 6113 6114 6115 6116 6117 6118 |
# File 'lib/polars/data_frame.rb', line 6112 def iter_columns return to_enum(:iter_columns) unless block_given? _df.get_columns.each do |s| yield Utils.wrap_s(s) end end |
#iter_rows(named: false, buffer_size: 512, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 |
# File 'lib/polars/data_frame.rb', line 6015 def iter_rows(named: false, buffer_size: 512, &block) return to_enum(:iter_rows, named: named, buffer_size: buffer_size) unless block_given? # load into the local namespace for a modest performance boost in the hot loops columns = self.columns # note: buffering rows results in a 2-4x speedup over individual calls # to ".row(i)", so it should only be disabled in extremely specific cases. if buffer_size offset = 0 while offset < height zerocopy_slice = slice(offset, buffer_size) rows_chunk = zerocopy_slice.rows(named: false) if named rows_chunk.each do |row| yield columns.zip(row).to_h end else rows_chunk.each(&block) end offset += buffer_size end elsif named height.times do |i| yield columns.zip(row(i)).to_h end else height.times do |i| yield row(i) end end end |
#iter_slices(n_rows: 10_000) ⇒ Object
Returns a non-copying iterator of slices over the underlying DataFrame.
6140 6141 6142 6143 6144 6145 6146 6147 6148 |
# File 'lib/polars/data_frame.rb', line 6140 def iter_slices(n_rows: 10_000) return to_enum(:iter_slices, n_rows: n_rows) unless block_given? offset = 0 while offset < height yield slice(offset, n_rows) offset += n_rows end end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame
Join in SQL-like fashion.
3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 |
# File 'lib/polars/data_frame.rb', line 3526 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", nulls_equal: false, coalesce: nil, maintain_order: nil ) lazy .join( other.lazy, left_on: left_on, right_on: right_on, on: on, how: how, suffix: suffix, validate: validate, nulls_equal: nulls_equal, coalesce: coalesce, maintain_order: maintain_order ) .collect(optimizations: QueryOptFlags._eager) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 |
# File 'lib/polars/data_frame.rb', line 3360 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true ) lazy .join_asof( other.lazy, left_on: left_on, right_on: right_on, on: on, by_left: by_left, by_right: by_right, by: by, strategy: strategy, suffix: suffix, tolerance: tolerance, allow_parallel: allow_parallel, force_parallel: force_parallel, coalesce: coalesce, allow_exact_matches: allow_exact_matches, check_sortedness: check_sortedness ) .collect(optimizations: QueryOptFlags._eager) end |
#join_where(other, *predicates, suffix: "_right") ⇒ DataFrame
The row order of the input DataFrames is not preserved.
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 |
# File 'lib/polars/data_frame.rb', line 3633 def join_where( other, *predicates, suffix: "_right" ) Utils.require_same_type(self, other) lazy .join_where( other.lazy, *predicates, suffix: suffix ) .collect(optimizations: QueryOptFlags._eager) end |
#lazy ⇒ LazyFrame
Start a lazy query from this point.
4817 4818 4819 |
# File 'lib/polars/data_frame.rb', line 4817 def lazy wrap_ldf(_df.lazy) end |
#limit(n = 5) ⇒ DataFrame
Get the first n rows.
Alias for #head.
2515 2516 2517 |
# File 'lib/polars/data_frame.rb', line 2515 def limit(n = 5) head(n) end |
#map_rows(return_dtype: nil, inference_size: 256, &function) ⇒ Object
The frame-level apply cannot track column names (as the UDF is a black-box
that may arbitrarily drop, rearrange, transform, or add new columns); if you
want to apply a UDF such that column names are preserved, you should use the
expression-level apply syntax instead.
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
The UDF will receive each row as a tuple of values: udf(row).
Implementing logic using a Ruby function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:
- The native expression engine runs in Rust; UDFs run in Ruby.
- Use of Ruby UDFs forces the DataFrame to be materialized in memory.
- Polars-native expressions can be parallelised (UDFs cannot).
- Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
3709 3710 3711 3712 3713 3714 3715 3716 |
# File 'lib/polars/data_frame.rb', line 3709 def map_rows(return_dtype: nil, inference_size: 256, &function) out, is_df = _df.map_rows(function, return_dtype, inference_size) if is_df _from_rbdf(out) else _from_rbdf(Utils.wrap_s(out).to_frame._df) end end |
#max ⇒ DataFrame
Aggregate the columns of this DataFrame to their maximum value.
5122 5123 5124 |
# File 'lib/polars/data_frame.rb', line 5122 def max lazy.max.collect(optimizations: QueryOptFlags._eager) end |
#max_horizontal ⇒ Series
Get the maximum value horizontally across columns.
5146 5147 5148 |
# File 'lib/polars/data_frame.rb', line 5146 def max_horizontal select(max: F.max_horizontal(F.all)).to_series end |
#mean ⇒ DataFrame
Aggregate the columns of this DataFrame to their mean value.
5278 5279 5280 |
# File 'lib/polars/data_frame.rb', line 5278 def mean lazy.mean.collect(optimizations: QueryOptFlags._eager) end |
#mean_horizontal(ignore_nulls: true) ⇒ Series
Take the mean of all values horizontally across columns.
5306 5307 5308 5309 5310 |
# File 'lib/polars/data_frame.rb', line 5306 def mean_horizontal(ignore_nulls: true) select( mean: F.mean_horizontal(F.all, ignore_nulls: ignore_nulls) ).to_series end |
#median ⇒ DataFrame
Aggregate the columns of this DataFrame to their median value.
5416 5417 5418 |
# File 'lib/polars/data_frame.rb', line 5416 def median lazy.median.collect(optimizations: QueryOptFlags._eager) end |
#merge_sorted(other, key) ⇒ DataFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both DataFrames must be equal.
6383 6384 6385 |
# File 'lib/polars/data_frame.rb', line 6383 def merge_sorted(other, key) lazy.merge_sorted(other.lazy, key).collect(optimizations: QueryOptFlags._eager) end |
#min ⇒ DataFrame
Aggregate the columns of this DataFrame to their minimum value.
5172 5173 5174 |
# File 'lib/polars/data_frame.rb', line 5172 def min lazy.min.collect(optimizations: QueryOptFlags._eager) end |
#min_horizontal ⇒ Series
Get the minimum value horizontally across columns.
5196 5197 5198 |
# File 'lib/polars/data_frame.rb', line 5196 def min_horizontal select(min: F.min_horizontal(F.all)).to_series end |
#n_chunks(strategy: "first") ⇒ Object
Get number of chunks used by the ChunkedArrays of this DataFrame.
5090 5091 5092 5093 5094 5095 5096 5097 5098 |
# File 'lib/polars/data_frame.rb', line 5090 def n_chunks(strategy: "first") if strategy == "first" _df.n_chunks elsif strategy == "all" get_columns.map(&:n_chunks) else raise ArgumentError, "Strategy: '{strategy}' not understood. Choose one of {{'first', 'all'}}" end end |
#n_unique(subset: nil) ⇒ DataFrame
Return the number of unique rows, or the number of unique row-subsets.
5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 |
# File 'lib/polars/data_frame.rb', line 5595 def n_unique(subset: nil) if subset.is_a?(StringIO) subset = [Polars.col(subset)] elsif subset.is_a?(Expr) subset = [subset] end if subset.is_a?(::Array) && subset.length == 1 expr = Utils.wrap_expr(Utils.parse_into_expression(subset[0], str_as_lit: false)) else struct_fields = subset.nil? ? Polars.all : subset expr = Polars.struct(struct_fields) end df = lazy.select(expr.n_unique).collect df.is_empty ? 0 : df.row(0)[0] end |
#null_count ⇒ DataFrame
Create a new DataFrame that shows the null counts per column.
5645 5646 5647 |
# File 'lib/polars/data_frame.rb', line 5645 def null_count _from_rbdf(_df.null_count) end |
#partition_by(by, *more_by, maintain_order: true, include_key: true, as_dict: false) ⇒ Object
Split into multiple DataFrames partitioned by groups.
4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 |
# File 'lib/polars/data_frame.rb', line 4685 def partition_by(by, *more_by, maintain_order: true, include_key: true, as_dict: false) by_parsed = Utils.(self, by, *more_by) partitions = _df.partition_by(by_parsed, maintain_order, include_key).map { |df| _from_rbdf(df) } if as_dict if include_key names = partitions.map { |p| p.select(by_parsed).row(0) } else if !maintain_order msg = "cannot use `partition_by` with `maintain_order: false, include_key: false, as_dict: true`" raise ArgumentError, msg end names = select(by_parsed).unique(maintain_order: true).rows end return names.zip(partitions).to_h end partitions end |
#pipe(function, *args, **kwargs, &block) ⇒ Object
It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See #lazy.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
2708 2709 2710 |
# File 'lib/polars/data_frame.rb', line 2708 def pipe(function, *args, **kwargs, &block) function.(self, *args, **kwargs, &block) end |
#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 |
# File 'lib/polars/data_frame.rb', line 4373 def pivot( on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_" ) index = Utils.(self, index) on = Utils.(self, on) if !values.nil? values = Utils.(self, values) end if aggregate_function.is_a?(::String) case aggregate_function when "first" aggregate_expr = F.element.first._rbexpr when "sum" aggregate_expr = F.element.sum._rbexpr when "max" aggregate_expr = F.element.max._rbexpr when "min" aggregate_expr = F.element.min._rbexpr when "mean" aggregate_expr = F.element.mean._rbexpr when "median" aggregate_expr = F.element.median._rbexpr when "last" aggregate_expr = F.element.last._rbexpr when "len" aggregate_expr = F.len._rbexpr when "count" warn "`aggregate_function: \"count\"` input for `pivot` is deprecated. Use `aggregate_function: \"len\"` instead." aggregate_expr = F.len._rbexpr else raise ArgumentError, "Argument aggregate fn: '#{aggregate_fn}' was not expected." end elsif aggregate_function.nil? aggregate_expr = nil else aggregate_expr = aggregate_function._rbexpr end _from_rbdf( _df.pivot_expr( on, index, values, maintain_order, sort_columns, aggregate_expr, separator ) ) end |
#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Object
Plot data.
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
# File 'lib/polars/data_frame.rb', line 120 def plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) plot = DataFramePlot.new(self) return plot if x.nil? && y.nil? raise ArgumentError, "Must specify columns" if x.nil? || y.nil? type ||= begin if self[x].dtype.numeric? && self[y].dtype.numeric? "scatter" elsif self[x].dtype == String && self[y].dtype.numeric? "column" elsif (self[x].dtype == Date || self[x].dtype == Datetime) && self[y].dtype.numeric? "line" else raise "Cannot determine type. Use the type option." end end case type when "line" plot.line(x, y, color: group) when "area" plot.area(x, y, color: group) when "pie" raise ArgumentError, "Cannot use group option with pie chart" unless group.nil? plot.pie(x, y) when "column" plot.column(x, y, color: group, stacked: stacked) when "bar" plot.(x, y, color: group, stacked: stacked) when "scatter" plot.scatter(x, y, color: group) else raise ArgumentError, "Invalid type: #{type}" end end |
#product ⇒ DataFrame
Aggregate the columns of this DataFrame to their product values.
5442 5443 5444 |
# File 'lib/polars/data_frame.rb', line 5442 def product select(Polars.all.product) end |
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Aggregate the columns of this DataFrame to their quantile value.
5473 5474 5475 |
# File 'lib/polars/data_frame.rb', line 5473 def quantile(quantile, interpolation: "nearest") lazy.quantile(quantile, interpolation: interpolation).collect(optimizations: QueryOptFlags._eager) end |
#rechunk ⇒ DataFrame
This will make sure all subsequent operations have optimal and predictable performance.
5619 5620 5621 |
# File 'lib/polars/data_frame.rb', line 5619 def rechunk _from_rbdf(_df.rechunk) end |
#remove(*predicates, **constraints) ⇒ DataFrame
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to true are retained
(this includes rows where the predicate evaluates as null).
1878 1879 1880 1881 1882 1883 1884 1885 |
# File 'lib/polars/data_frame.rb', line 1878 def remove( *predicates, **constraints ) lazy .remove(*predicates, **constraints) .collect(optimizations: QueryOptFlags._eager) end |
#rename(mapping, strict: true) ⇒ DataFrame
Rename column names.
1662 1663 1664 |
# File 'lib/polars/data_frame.rb', line 1662 def rename(mapping, strict: true) lazy.rename(mapping, strict: strict).collect(optimizations: QueryOptFlags._eager) end |
#replace_column(index, column) ⇒ DataFrame
Replace a column at an index location.
2105 2106 2107 2108 2109 2110 2111 |
# File 'lib/polars/data_frame.rb', line 2105 def replace_column(index, column) if index < 0 index = width + index end _df.replace_column(index, column._s) self end |
#reverse ⇒ DataFrame
Reverse the DataFrame.
1627 1628 1629 |
# File 'lib/polars/data_frame.rb', line 1627 def reverse select(Polars.col("*").reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", group_by: nil) ⇒ RollingGroupBy
Create rolling groups based on a time column.
Different from a dynamic_group_by the windows are now determined by the
individual values and are not of constant intervals. For constant intervals use
group_by_dynamic
The period and offset arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2875 2876 2877 2878 2879 2880 2881 2882 2883 |
# File 'lib/polars/data_frame.rb', line 2875 def rolling( index_column:, period:, offset: nil, closed: "right", group_by: nil ) RollingGroupBy.new(self, index_column, period, offset, closed, group_by) end |
#row(index = nil, by_predicate: nil, named: false) ⇒ Object
The index and by_predicate params are mutually exclusive. Additionally,
to ensure clarity, the by_predicate parameter must be supplied by keyword.
When using by_predicate it is an error condition if anything other than
one row is returned; more than one row raises TooManyRowsReturned, and
zero rows will raise NoRowsReturned (both inherit from RowsException).
Get a row as tuple, either by index or by predicate.
5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 |
# File 'lib/polars/data_frame.rb', line 5840 def row(index = nil, by_predicate: nil, named: false) if !index.nil? && !by_predicate.nil? raise ArgumentError, "Cannot set both 'index' and 'by_predicate'; mutually exclusive" elsif index.is_a?(Expr) raise TypeError, "Expressions should be passed to the 'by_predicate' param" end if !index.nil? row = _df.row_tuple(index) if named columns.zip(row).to_h else row end elsif !by_predicate.nil? if !by_predicate.is_a?(Expr) raise TypeError, "Expected by_predicate to be an expression; found #{by_predicate.class.name}" end rows = filter(by_predicate).rows n_rows = rows.length if n_rows > 1 raise TooManyRowsReturned, "Predicate #{by_predicate} returned #{n_rows} rows" elsif n_rows == 0 raise NoRowsReturned, "Predicate #{by_predicate} returned no rows" end row = rows[0] if named columns.zip(row).to_h else row end else raise ArgumentError, "One of 'index' or 'by_predicate' must be set" end end |
#rows(named: false) ⇒ Array
Convert columnar data to rows as Ruby arrays.
5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 |
# File 'lib/polars/data_frame.rb', line 5897 def rows(named: false) if named columns = self.columns _df.row_tuples.map do |v| columns.zip(v).to_h end else _df.row_tuples end end |
#rows_by_key(key, named: false, include_key: false, unique: false) ⇒ Hash
Convert columnar data to rows as Ruby arrays in a hash keyed by some column.
This method is like rows, but instead of returning rows in a flat list, rows
are grouped by the values in the key column(s) and returned as a hash.
Note that this method should not be used in place of native operations, due to the high cost of materializing all frame data out into a hash; it should be used only when you need to move the values out into a Ruby data structure or other object that cannot operate directly with Polars/Arrow.
5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 |
# File 'lib/polars/data_frame.rb', line 5964 def rows_by_key(key, named: false, include_key: false, unique: false) key = Utils.(self, key) keys = key.size == 1 ? get_column(key[0]) : select(key).iter_rows if include_key values = self else data_cols = schema.names - key values = select(data_cols) end zipped = keys.each.zip(values.iter_rows(named: named)) # if unique, we expect to write just one entry per key; otherwise, we're # returning a list of rows for each key, so append into a hash of arrays. if unique zipped.to_h else zipped.each_with_object({}) { |(key, data), h| (h[key] ||= []) << data } end end |
#sample(n: nil, fraction: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame
Sample from this DataFrame.
5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 |
# File 'lib/polars/data_frame.rb', line 5685 def sample( n: nil, fraction: nil, with_replacement: false, shuffle: false, seed: nil ) if !n.nil? && !fraction.nil? raise ArgumentError, "cannot specify both `n` and `fraction`" end if n.nil? && !fraction.nil? fraction = Series.new("fraction", [fraction]) unless fraction.is_a?(Series) return _from_rbdf( _df.sample_frac(fraction._s, with_replacement, shuffle, seed) ) end if n.nil? n = 1 end n = Series.new("", [n]) unless n.is_a?(Series) _from_rbdf(_df.sample_n(n._s, with_replacement, shuffle, seed)) end |
#schema ⇒ Hash
Get the schema.
285 286 287 |
# File 'lib/polars/data_frame.rb', line 285 def schema Schema.new(columns.zip(dtypes).to_h) end |
#select(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
4909 4910 4911 |
# File 'lib/polars/data_frame.rb', line 4909 def select(*exprs, **named_exprs) lazy.select(*exprs, **named_exprs).collect(optimizations: QueryOptFlags._eager) end |
#select_seq(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
4927 4928 4929 4930 4931 |
# File 'lib/polars/data_frame.rb', line 4927 def select_seq(*exprs, **named_exprs) lazy .select_seq(*exprs, **named_exprs) .collect(optimizations: QueryOptFlags._eager) end |
#serialize(file = nil) ⇒ Object
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Serialize this DataFrame to a file or string.
870 871 872 873 874 |
# File 'lib/polars/data_frame.rb', line 870 def serialize(file = nil) serializer = _df.method(:serialize_binary) Utils.serialize_polars_object(serializer, file) end |
#set_sorted(column, descending: false) ⇒ DataFrame
This can lead to incorrect results if the data is NOT sorted! Use with care!
Flag a column as sorted.
This can speed up future operations.
6400 6401 6402 6403 6404 6405 6406 6407 |
# File 'lib/polars/data_frame.rb', line 6400 def set_sorted( column, descending: false ) lazy .set_sorted(column, descending: descending) .collect(optimizations: QueryOptFlags._eager) end |
#shape ⇒ Array
Get the shape of the DataFrame.
164 165 166 |
# File 'lib/polars/data_frame.rb', line 164 def shape _df.shape end |
#shift(n = 1, fill_value: nil) ⇒ DataFrame
Shift values by the given period.
4750 4751 4752 |
# File 'lib/polars/data_frame.rb', line 4750 def shift(n = 1, fill_value: nil) lazy.shift(n, fill_value: fill_value).collect(optimizations: QueryOptFlags._eager) end |
#shrink_to_fit(in_place: false) ⇒ DataFrame
Shrink DataFrame memory usage.
Shrinks to fit the exact capacity needed to hold the data.
6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 |
# File 'lib/polars/data_frame.rb', line 6155 def shrink_to_fit(in_place: false) if in_place _df.shrink_to_fit self else df = clone df._df.shrink_to_fit df end end |
#slice(offset, length = nil) ⇒ DataFrame
Get a slice of this DataFrame.
2482 2483 2484 2485 2486 2487 |
# File 'lib/polars/data_frame.rb', line 2482 def slice(offset, length = nil) if !length.nil? && length < 0 length = height - offset + length end _from_rbdf(_df.slice(offset, length)) end |
#sort(by, *more_by, descending: false, nulls_last: false, multithreaded: true, maintain_order: false) ⇒ DataFrame
Sort the dataframe by the given columns.
2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 |
# File 'lib/polars/data_frame.rb', line 2170 def sort( by, *more_by, descending: false, nulls_last: false, multithreaded: true, maintain_order: false ) lazy .sort( by, *more_by, descending: descending, nulls_last: nulls_last, multithreaded: multithreaded, maintain_order: maintain_order ) .collect(optimizations: QueryOptFlags._eager) end |
#sort!(by, descending: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column in-place.
2200 2201 2202 |
# File 'lib/polars/data_frame.rb', line 2200 def sort!(by, descending: false, nulls_last: false) self._df = sort(by, descending: descending, nulls_last: nulls_last)._df end |
#sql(query, table_name: "self") ⇒ DataFrame
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- The calling frame is automatically registered as a table in the SQL context
under the name "self". If you want access to the DataFrames and LazyFrames
found in the current globals, use the top-level :meth:
pl.sql <polars.sql>. - More control over registration and execution behaviour is available by
using the :class:
SQLContextobject. - The SQL query executes in lazy mode before being collected and returned as a DataFrame.
Execute a SQL query against the DataFrame.
2272 2273 2274 2275 2276 2277 |
# File 'lib/polars/data_frame.rb', line 2272 def sql(query, table_name: "self") ctx = SQLContext.new(eager: true) name = table_name || "self" ctx.register(name, self) ctx.execute(query) end |
#std(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their standard deviation value.
5349 5350 5351 |
# File 'lib/polars/data_frame.rb', line 5349 def std(ddof: 1) lazy.std(ddof: ddof).collect(optimizations: QueryOptFlags._eager) end |
#sum ⇒ DataFrame
Aggregate the columns of this DataFrame to their sum value.
5222 5223 5224 |
# File 'lib/polars/data_frame.rb', line 5222 def sum lazy.sum.collect(optimizations: QueryOptFlags._eager) end |
#sum_horizontal(ignore_nulls: true) ⇒ Series
Sum all values horizontally across columns.
5250 5251 5252 5253 5254 |
# File 'lib/polars/data_frame.rb', line 5250 def sum_horizontal(ignore_nulls: true) select( sum: F.sum_horizontal(F.all, ignore_nulls: ignore_nulls) ).to_series end |
#tail(n = 5) ⇒ DataFrame
Get the last n rows.
2577 2578 2579 |
# File 'lib/polars/data_frame.rb', line 2577 def tail(n = 5) _from_rbdf(_df.tail(n)) end |
#to_a ⇒ Array
Returns an array representing the DataFrame
402 403 404 |
# File 'lib/polars/data_frame.rb', line 402 def to_a rows(named: true) end |
#to_csv(**options) ⇒ String
Write to comma-separated values (CSV) string.
1081 1082 1083 |
# File 'lib/polars/data_frame.rb', line 1081 def to_csv(**) write_csv(**) end |
#to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) ⇒ DataFrame
Get one hot encoded dummy variables.
5510 5511 5512 5513 5514 5515 |
# File 'lib/polars/data_frame.rb', line 5510 def to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) if columns.is_a?(::String) columns = [columns] end _from_rbdf(_df.to_dummies(columns, separator, drop_first, drop_nulls)) end |
#to_h(as_series: true) ⇒ Hash
Convert DataFrame to a hash mapping column name to values.
763 764 765 766 767 768 769 |
# File 'lib/polars/data_frame.rb', line 763 def to_h(as_series: true) if as_series get_columns.to_h { |s| [s.name, s] } else get_columns.to_h { |s| [s.name, s.to_a] } end end |
#to_hashes ⇒ Array
Convert every row to a hash.
780 781 782 |
# File 'lib/polars/data_frame.rb', line 780 def to_hashes rows(named: true) end |
#to_numo ⇒ Numo::NArray
Convert DataFrame to a 2D Numo array.
This operation clones data.
796 797 798 799 800 801 802 803 |
# File 'lib/polars/data_frame.rb', line 796 def to_numo out = _df.to_numo if out.nil? Numo::NArray.vstack(width.times.map { |i| to_series(i).to_numo }).transpose else out end end |
#to_s ⇒ String Also known as: inspect
Returns a string representing the DataFrame.
394 395 396 |
# File 'lib/polars/data_frame.rb', line 394 def to_s _df.to_s end |
#to_series(index = 0) ⇒ Series
Select column as Series at index location.
831 832 833 834 835 836 |
# File 'lib/polars/data_frame.rb', line 831 def to_series(index = 0) if index < 0 index = columns.length + index end Utils.wrap_s(_df.to_series(index)) end |
#to_struct(name = "") ⇒ Series
Convert a DataFrame to a Series of type Struct.
6296 6297 6298 |
# File 'lib/polars/data_frame.rb', line 6296 def to_struct(name = "") Utils.wrap_s(_df.to_struct(name)) end |
#top_k(k, by:, reverse: false) ⇒ DataFrame
Return the k largest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call sort after this function if you wish the
output to be sorted.
2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 |
# File 'lib/polars/data_frame.rb', line 2333 def top_k( k, by:, reverse: false ) lazy .top_k(k, by: by, reverse: reverse) .collect( optimizations: QueryOptFlags.new( projection_pushdown: false, predicate_pushdown: false, comm_subplan_elim: false, slice_pushdown: true ) ) end |
#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame
This is a very expensive operation. Perhaps you can do it differently.
Transpose a DataFrame over the diagonal.
1599 1600 1601 1602 |
# File 'lib/polars/data_frame.rb', line 1599 def transpose(include_header: false, header_name: "column", column_names: nil) keep_names_as = include_header ? header_name : nil _from_rbdf(_df.transpose(keep_names_as, column_names)) end |
#unique(maintain_order: false, subset: nil, keep: "any") ⇒ DataFrame
Note that this fails if there is a column of type List in the DataFrame or
subset.
Drop duplicate rows from this DataFrame.
5555 5556 5557 5558 5559 5560 5561 5562 |
# File 'lib/polars/data_frame.rb', line 5555 def unique(maintain_order: false, subset: nil, keep: "any") self._from_rbdf( lazy .unique(maintain_order: maintain_order, subset: subset, keep: keep) .collect(optimizations: QueryOptFlags._eager) ._df ) end |
#unnest(columns, *more_columns, separator: nil) ⇒ DataFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame on the location of the
struct type.
6337 6338 6339 |
# File 'lib/polars/data_frame.rb', line 6337 def unnest(columns, *more_columns, separator: nil) lazy.unnest(columns, *more_columns, separator: separator).collect(optimizations: QueryOptFlags._eager) end |
#unpivot(on = nil, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
4475 4476 4477 4478 4479 4480 |
# File 'lib/polars/data_frame.rb', line 4475 def unpivot(on = nil, index: nil, variable_name: nil, value_name: nil) on = on.nil? ? [] : Utils.(self, on) index = index.nil? ? [] : Utils.(self, index) _from_rbdf(_df.unpivot(on, index, value_name, variable_name)) end |
#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame
This functionality is experimental and may be subject to changes without it being considered a breaking change.
Unstack a long table to a wide form without doing an aggregation.
This can be much faster than a pivot, because it can skip the grouping phase.
4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 |
# File 'lib/polars/data_frame.rb', line 4553 def unstack(step:, how: "vertical", columns: nil, fill_values: nil) if !columns.nil? df = select(columns) else df = self end height = df.height if how == "vertical" n_rows = step n_cols = (height / n_rows.to_f).ceil else n_cols = step n_rows = (height / n_cols.to_f).ceil end n_fill = n_cols * n_rows - height if n_fill > 0 if !fill_values.is_a?(::Array) fill_values = [fill_values] * df.width end df = df.select( df.get_columns.zip(fill_values).map do |s, next_fill| s.extend_constant(next_fill, n_fill) end ) end if how == "horizontal" df = ( df.with_columns( (Polars.arange(0, n_cols * n_rows, eager: true) % n_cols).alias( "__sort_order" ) ) .sort("__sort_order") .drop("__sort_order") ) end zfill_val = Math.log10(n_cols).floor + 1 slices = df.get_columns.flat_map do |s| n_cols.times.map do |slice_nbr| s.slice(slice_nbr * n_rows, n_rows).alias("%s_%0#{zfill_val}d" % [s.name, slice_nbr]) end end _from_rbdf(DataFrame.new(slices)._df) end |
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ DataFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This is syntactic sugar for a left/inner join that preserves the order
of the left DataFrame by default, with an optional coalesce when
include_nulls: false.
Update the values in this DataFrame with the values in other.
6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 |
# File 'lib/polars/data_frame.rb', line 6517 def update( other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left" ) Utils.require_same_type(self, other) lazy .update( other.lazy, on: on, how: how, left_on: left_on, right_on: right_on, include_nulls: include_nulls, maintain_order: maintain_order ) .collect(optimizations: QueryOptFlags._eager) end |
#upsample(time_column:, every:, group_by: nil, maintain_order: false) ⇒ DataFrame
Upsample a DataFrame at a regular frequency.
The every and offset arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 |
# File 'lib/polars/data_frame.rb', line 3226 def upsample( time_column:, every:, group_by: nil, maintain_order: false ) if group_by.nil? group_by = [] end if group_by.is_a?(::String) group_by = [group_by] end every = Utils.parse_as_duration_string(every) _from_rbdf( _df.upsample(group_by, time_column, every, maintain_order) ) end |
#var(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their variance value.
5390 5391 5392 |
# File 'lib/polars/data_frame.rb', line 5390 def var(ddof: 1) lazy.var(ddof: ddof).collect(optimizations: QueryOptFlags._eager) end |
#vstack(other, in_place: false) ⇒ DataFrame
Grow this DataFrame vertically by stacking a DataFrame to it.
3797 3798 3799 3800 3801 3802 3803 3804 |
# File 'lib/polars/data_frame.rb', line 3797 def vstack(other, in_place: false) if in_place _df.vstack_mut(other._df) self else _from_rbdf(_df.vstack(other._df)) end end |
#width ⇒ Integer
Get the width of the DataFrame.
191 192 193 |
# File 'lib/polars/data_frame.rb', line 191 def width _df.width end |
#with_columns(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
5041 5042 5043 |
# File 'lib/polars/data_frame.rb', line 5041 def with_columns(*exprs, **named_exprs) lazy.with_columns(*exprs, **named_exprs).collect(optimizations: QueryOptFlags._eager) end |
#with_columns_seq(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
5061 5062 5063 5064 5065 5066 5067 5068 |
# File 'lib/polars/data_frame.rb', line 5061 def with_columns_seq( *exprs, **named_exprs ) lazy .with_columns_seq(*exprs, **named_exprs) .collect(optimizations: QueryOptFlags._eager) end |
#with_row_index(name: "index", offset: 0) ⇒ DataFrame
Add a column at index 0 that counts the rows.
2740 2741 2742 |
# File 'lib/polars/data_frame.rb', line 2740 def with_row_index(name: "index", offset: 0) _from_rbdf(_df.with_row_index(name, offset)) end |
#write_avro(file, compression = "uncompressed", name: "") ⇒ nil
Write to Apache Avro file.
1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 |
# File 'lib/polars/data_frame.rb', line 1095 def write_avro(file, compression = "uncompressed", name: "") if compression.nil? compression = "uncompressed" end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if name.nil? name = "" end _df.write_avro(file, compression, name) end |
#write_csv(file = nil, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, storage_options: nil, credential_provider: "auto", retries: 2) ⇒ String?
Write to comma-separated values (CSV) file.
999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 |
# File 'lib/polars/data_frame.rb', line 999 def write_csv( file = nil, include_bom: false, include_header: true, separator: ",", line_terminator: "\n", quote_char: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_scientific: nil, float_precision: nil, decimal_comma: false, null_value: nil, quote_style: nil, storage_options: nil, credential_provider: "auto", retries: 2 ) Utils._check_arg_is_1byte("separator", separator, false) Utils._check_arg_is_1byte("quote_char", quote_char, true) if null_value == "" null_value = nil end if file.nil? buffer = StringIO.new buffer.set_encoding(Encoding::BINARY) lazy.sink_csv( buffer, include_bom: include_bom, include_header: include_header, separator: separator, line_terminator: line_terminator, quote_char: quote_char, batch_size: batch_size, datetime_format: datetime_format, date_format: date_format, time_format: time_format, float_scientific: float_scientific, float_precision: float_precision, decimal_comma: decimal_comma, null_value: null_value, quote_style: quote_style, storage_options: , credential_provider: credential_provider, retries: retries ) return buffer.string.force_encoding(Encoding::UTF_8) end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end lazy.sink_csv( file, include_bom: include_bom, include_header: include_header, separator: separator, line_terminator: line_terminator, quote_char: quote_char, batch_size: batch_size, datetime_format: datetime_format, date_format: date_format, time_format: time_format, float_scientific: float_scientific, float_precision: float_precision, decimal_comma: decimal_comma, null_value: null_value, quote_style: quote_style, storage_options: , credential_provider: credential_provider, retries: retries ) nil end |
#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Write the data in a Polars DataFrame to a database.
1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 |
# File 'lib/polars/data_frame.rb', line 1300 def write_database(table_name, connection = nil, if_table_exists: "fail") if !defined?(ActiveRecord) raise Error, "Active Record not available" elsif ActiveRecord::VERSION::MAJOR < 7 raise Error, "Requires Active Record 7+" end valid_write_modes = ["append", "replace", "fail"] if !valid_write_modes.include?(if_table_exists) msg = "write_database `if_table_exists` must be one of #{valid_write_modes.inspect}, got #{if_table_exists.inspect}" raise ArgumentError, msg end with_connection(connection) do |connection| table_exists = connection.table_exists?(table_name) if table_exists && if_table_exists == "fail" raise ArgumentError, "Table already exists" end create_table = !table_exists || if_table_exists == "replace" maybe_transaction(connection, create_table) do if create_table mysql = connection.adapter_name.match?(/mysql|trilogy/i) force = if_table_exists == "replace" connection.create_table(table_name, id: false, force: force) do |t| schema.each do |c, dtype| = {} column_type = case dtype when Binary :binary when Boolean :boolean when Date :date when Datetime :datetime when Decimal if mysql [:precision] = dtype.precision || 65 [:scale] = dtype.scale || 30 end :decimal when Float32 [:limit] = 24 :float when Float64 [:limit] = 53 :float when Int8 [:limit] = 1 :integer when Int16 [:limit] = 2 :integer when Int32 [:limit] = 4 :integer when Int64 [:limit] = 8 :integer when UInt8 if mysql [:limit] = 1 [:unsigned] = true else [:limit] = 2 end :integer when UInt16 if mysql [:limit] = 2 [:unsigned] = true else [:limit] = 4 end :integer when UInt32 if mysql [:limit] = 4 [:unsigned] = true else [:limit] = 8 end :integer when UInt64 if mysql [:limit] = 8 [:unsigned] = true :integer else [:precision] = 20 [:scale] = 0 :decimal end when String :text when Time :time else raise ArgumentError, "column type not supported yet: #{dtype}" end t.column c, column_type, ** end end end quoted_table = connection.quote_table_name(table_name) quoted_columns = columns.map { |c| connection.quote_column_name(c) } rows = cast({Polars::UInt64 => Polars::String}).rows(named: false).map { |row| "(#{row.map { |v| connection.quote(v) }.join(", ")})" } connection.exec_update("INSERT INTO #{quoted_table} (#{quoted_columns.join(", ")}) VALUES #{rows.join(", ")}") end end end |
#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil
Write DataFrame as delta table.
1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 |
# File 'lib/polars/data_frame.rb', line 1463 def write_delta( target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil ) Polars.send(:_check_if_delta_available) if Utils.pathlike?(target) target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false) end data = self if mode == "merge" if .nil? msg = "You need to pass delta_merge_options with at least a given predicate for `MERGE` to work." raise ArgumentError, msg end if target.is_a?(::String) dt = DeltaLake::Table.new(target, storage_options: ) else dt = target end predicate = .delete(:predicate) dt.merge(data, predicate, **) else ||= {} DeltaLake.write( target, data, mode: mode, storage_options: , ** ) end end |
#write_iceberg(target, mode:) ⇒ nil
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
Write DataFrame to an Iceberg table.
1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 |
# File 'lib/polars/data_frame.rb', line 1430 def write_iceberg(target, mode:) require "iceberg" table = if target.is_a?(Iceberg::Table) target else raise Todo end data = self if mode == "append" table.append(data) else raise Todo end end |
#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, credential_provider: "auto", retries: 2) ⇒ nil
Write to Arrow IPC binary stream or Feather file.
1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 |
# File 'lib/polars/data_frame.rb', line 1139 def write_ipc( file, compression: "uncompressed", compat_level: nil, storage_options: nil, credential_provider: "auto", retries: 2 ) return_bytes = file.nil? target = nil if file.nil? target = StringIO.new target.set_encoding(Encoding::BINARY) else target = file end lazy.sink_ipc( target, compression: compression, compat_level: compat_level, storage_options: , credential_provider: credential_provider, retries: retries ) return_bytes ? target.string : nil end |
#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object
Write to Arrow IPC record batch stream.
See "Streaming format" in https://arrow.apache.org/docs/python/ipc.html.
1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 |
# File 'lib/polars/data_frame.rb', line 1191 def write_ipc_stream( file, compression: "uncompressed", compat_level: nil ) return_bytes = file.nil? if return_bytes file = StringIO.new file.set_encoding(Encoding::BINARY) elsif Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if compat_level.nil? compat_level = true end if compression.nil? compression = "uncompressed" end _df.write_ipc_stream(file, compression, compat_level) return_bytes ? file.string : nil end |
#write_json(file = nil) ⇒ nil
Serialize to JSON representation.
892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 |
# File 'lib/polars/data_frame.rb', line 892 def write_json(file = nil) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end to_string_io = !file.nil? && file.is_a?(StringIO) if file.nil? || to_string_io buf = StringIO.new buf.set_encoding(Encoding::BINARY) _df.write_json(buf) json_bytes = buf.string json_str = json_bytes.force_encoding(Encoding::UTF_8) if to_string_io file.write(json_str) else return json_str end else _df.write_json(file) end nil end |
#write_ndjson(file = nil) ⇒ nil
Serialize to newline delimited JSON representation.
931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 |
# File 'lib/polars/data_frame.rb', line 931 def write_ndjson(file = nil) should_return_buffer = false target = nil if file.nil? target = StringIO.new target.set_encoding(Encoding::BINARY) should_return_buffer = true elsif Utils.pathlike?(file) target = Utils.normalize_filepath(file) else target = file end lazy.sink_ndjson( target ) if should_return_buffer return target.string.force_encoding(Encoding::UTF_8) end nil end |
#write_parquet(file, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, partition_by: nil, partition_chunk_size_bytes: 4_294_967_296, storage_options: nil, credential_provider: "auto", retries: 2, metadata: nil, mkdir: false) ⇒ nil
Write to Apache Parquet file.
1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 |
# File 'lib/polars/data_frame.rb', line 1240 def write_parquet( file, compression: "zstd", compression_level: nil, statistics: true, row_group_size: nil, data_page_size: nil, partition_by: nil, partition_chunk_size_bytes: 4_294_967_296, storage_options: nil, credential_provider: "auto", retries: 2, metadata: nil, mkdir: false ) if compression.nil? compression = "uncompressed" end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end target = file if !partition_by.nil? raise Todo end lazy.sink_parquet( target, compression: compression, compression_level: compression_level, statistics: statistics, row_group_size: row_group_size, data_page_size: data_page_size, storage_options: , credential_provider: credential_provider, retries: retries, metadata: , mkdir: mkdir ) end |