Class: Polars::DataFrame
- Inherits:
-
Object
- Object
- Polars::DataFrame
- Includes:
- Plot
- Defined in:
- lib/polars/data_frame.rb
Overview
Two-dimensional data structure representing data as a table with rows and columns.
Class Method Summary collapse
-
.deserialize(source) ⇒ DataFrame
Read a serialized DataFrame from a file.
Instance Method Summary collapse
-
#!=(other) ⇒ DataFrame
Not equal.
-
#%(other) ⇒ DataFrame
Returns the modulo.
-
#*(other) ⇒ DataFrame
Performs multiplication.
-
#+(other) ⇒ DataFrame
Performs addition.
-
#-(other) ⇒ DataFrame
Performs subtraction.
-
#/(other) ⇒ DataFrame
Performs division.
-
#<(other) ⇒ DataFrame
Less than.
-
#<=(other) ⇒ DataFrame
Less than or equal.
-
#==(other) ⇒ DataFrame
Equal.
-
#>(other) ⇒ DataFrame
Greater than.
-
#>=(other) ⇒ DataFrame
Greater than or equal.
-
#[](*args) ⇒ Object
Returns subset of the DataFrame.
-
#[]=(*key, value) ⇒ Object
Set item.
-
#bottom_k(k, by:, reverse: false) ⇒ DataFrame
Return the
ksmallest rows. -
#cast(dtypes, strict: true) ⇒ DataFrame
Cast DataFrame column(s) to the specified dtype(s).
-
#clear(n = 0) ⇒ DataFrame
(also: #cleared)
Create an empty copy of the current DataFrame.
-
#collect_schema ⇒ Schema
Get an ordered mapping of column names to their data type.
-
#columns ⇒ Array
Get column names.
-
#columns=(columns) ⇒ Object
Change the column names of the DataFrame.
-
#delete(name) ⇒ Series
Drop in place if exists.
-
#describe ⇒ DataFrame
Summary statistics for a DataFrame.
-
#drop(*columns) ⇒ DataFrame
Remove column from DataFrame and return as new.
-
#drop_in_place(name) ⇒ Series
Drop in place.
-
#drop_nans(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more NaN values.
-
#drop_nulls(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more null values.
-
#dtypes ⇒ Array
Get dtypes of columns in DataFrame.
-
#each(&block) ⇒ Object
Returns an enumerator.
-
#each_row(named: true, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
-
#equals(other, null_equal: true) ⇒ Boolean
(also: #frame_equal)
Check if DataFrame is equal to other.
-
#estimated_size(unit = "b") ⇒ Numeric
Return an estimation of the total (heap) allocated size of the DataFrame.
-
#explode(columns) ⇒ DataFrame
Explode
DataFrameto long format by exploding a column with Lists. -
#extend(other) ⇒ DataFrame
Extend the memory backed by this
DataFramewith the values fromother. -
#fill_nan(fill_value) ⇒ DataFrame
Fill floating point NaN values by an Expression evaluation.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame
Fill null values using the specified value or strategy.
-
#filter(predicate) ⇒ DataFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#flags ⇒ Hash
Get flags that are set on the columns of this DataFrame.
-
#fold ⇒ Series
Apply a horizontal reduction on a DataFrame.
-
#gather_every(n, offset = 0) ⇒ DataFrame
(also: #take_every)
Take every nth row in the DataFrame and return as a new DataFrame.
-
#get_column(name) ⇒ Series
Get a single column as Series by name.
-
#get_column_index(name) ⇒ Series
(also: #find_idx_by_name)
Find the index of a column by name.
-
#get_columns ⇒ Array
Get the DataFrame as a Array of Series.
-
#group_by(by, maintain_order: false) ⇒ GroupBy
(also: #groupby, #group)
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ DataFrame
(also: #groupby_dynamic)
Group based on a time value (or index value of type
:i32,:i64). -
#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series
Hash and combine the rows in this DataFrame.
-
#head(n = 5) ⇒ DataFrame
Get the first
nrows. -
#height ⇒ Integer
(also: #count, #length, #size)
Get the height of the DataFrame.
-
#hstack(columns, in_place: false) ⇒ DataFrame
Return a new DataFrame grown horizontally by stacking multiple Series to it.
-
#include?(name) ⇒ Boolean
Check if DataFrame includes column.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ DataFrame
constructor
Create a new DataFrame.
-
#insert_column(index, series) ⇒ DataFrame
(also: #insert_at_idx)
Insert a Series at a certain column index.
-
#interpolate ⇒ DataFrame
Interpolate intermediate values.
-
#is_duplicated ⇒ Series
Get a mask of all duplicated rows in this DataFrame.
-
#is_empty ⇒ Boolean
(also: #empty?)
Check if the dataframe is empty.
-
#is_unique ⇒ Series
Get a mask of all unique rows in this DataFrame.
-
#item ⇒ Object
Return the dataframe as a scalar.
-
#iter_columns ⇒ Object
Returns an iterator over the columns of this DataFrame.
-
#iter_rows(named: false, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
-
#iter_slices(n_rows: 10_000) ⇒ Object
Returns a non-copying iterator of slices over the underlying DataFrame.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame
Join in SQL-like fashion.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame
Perform an asof join.
-
#join_where(other, *predicates, suffix: "_right") ⇒ DataFrame
Perform a join based on one or multiple (in)equality predicates.
-
#lazy ⇒ LazyFrame
Start a lazy query from this point.
-
#limit(n = 5) ⇒ DataFrame
Get the first
nrows. -
#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ Object
(also: #apply)
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
-
#max ⇒ DataFrame
Aggregate the columns of this DataFrame to their maximum value.
-
#max_horizontal ⇒ Series
Get the maximum value horizontally across columns.
-
#mean ⇒ DataFrame
Aggregate the columns of this DataFrame to their mean value.
-
#mean_horizontal(ignore_nulls: true) ⇒ Series
Take the mean of all values horizontally across columns.
-
#median ⇒ DataFrame
Aggregate the columns of this DataFrame to their median value.
-
#merge_sorted(other, key) ⇒ DataFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ DataFrame
Aggregate the columns of this DataFrame to their minimum value.
-
#min_horizontal ⇒ Series
Get the minimum value horizontally across columns.
-
#n_chunks(strategy: "first") ⇒ Object
Get number of chunks used by the ChunkedArrays of this DataFrame.
-
#n_unique(subset: nil) ⇒ DataFrame
Return the number of unique rows, or the number of unique row-subsets.
-
#null_count ⇒ DataFrame
Create a new DataFrame that shows the null counts per column.
-
#partition_by(groups, maintain_order: true, include_key: true, as_dict: false) ⇒ Object
Split into multiple DataFrames partitioned by groups.
-
#pipe(func, *args, **kwargs, &block) ⇒ Object
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
-
#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Vega::LiteChart
included
from Plot
Plot data.
-
#product ⇒ DataFrame
Aggregate the columns of this DataFrame to their product values.
-
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Aggregate the columns of this DataFrame to their quantile value.
-
#rechunk ⇒ DataFrame
This will make sure all subsequent operations have optimal and predictable performance.
-
#remove(*predicates, **constraints) ⇒ DataFrame
Remove rows, dropping those that match the given predicate expression(s).
-
#rename(mapping, strict: true) ⇒ DataFrame
Rename column names.
-
#replace(column, new_col) ⇒ DataFrame
Replace a column by a new Series.
-
#replace_column(index, series) ⇒ DataFrame
(also: #replace_at_idx)
Replace a column at an index location.
-
#reverse ⇒ DataFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ RollingGroupBy
(also: #groupby_rolling, #group_by_rolling)
Create rolling groups based on a time column.
-
#row(index = nil, by_predicate: nil, named: false) ⇒ Object
Get a row as tuple, either by index or by predicate.
-
#rows(named: false) ⇒ Array
Convert columnar data to rows as Ruby arrays.
-
#rows_by_key(key, named: false, include_key: false, unique: false) ⇒ Hash
Convert columnar data to rows as Ruby arrays in a hash keyed by some column.
-
#sample(n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame
Sample from this DataFrame.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
-
#select_seq(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
-
#serialize(file = nil) ⇒ Object
Serialize this DataFrame to a file or string.
-
#set_sorted(column, descending: false) ⇒ DataFrame
Flag a column as sorted.
-
#shape ⇒ Array
Get the shape of the DataFrame.
-
#shift(n, fill_value: nil) ⇒ DataFrame
Shift values by the given period.
-
#shift_and_fill(periods, fill_value) ⇒ DataFrame
Shift the values by a given period and fill the resulting null values.
-
#shrink_to_fit(in_place: false) ⇒ DataFrame
Shrink DataFrame memory usage.
-
#slice(offset, length = nil) ⇒ DataFrame
Get a slice of this DataFrame.
-
#sort(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column.
-
#sort!(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column in-place.
-
#sql(query, table_name: "self") ⇒ DataFrame
Execute a SQL query against the DataFrame.
-
#std(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their standard deviation value.
-
#sum ⇒ DataFrame
Aggregate the columns of this DataFrame to their sum value.
-
#sum_horizontal(ignore_nulls: true) ⇒ Series
Sum all values horizontally across columns.
-
#tail(n = 5) ⇒ DataFrame
Get the last
nrows. -
#to_a ⇒ Array
Returns an array representing the DataFrame.
-
#to_csv(**options) ⇒ String
Write to comma-separated values (CSV) string.
-
#to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) ⇒ DataFrame
Get one hot encoded dummy variables.
-
#to_h(as_series: true) ⇒ Hash
Convert DataFrame to a hash mapping column name to values.
-
#to_hashes ⇒ Array
Convert every row to a hash.
-
#to_numo ⇒ Numo::NArray
Convert DataFrame to a 2D Numo array.
-
#to_s ⇒ String
(also: #inspect)
Returns a string representing the DataFrame.
-
#to_series(index = 0) ⇒ Series
Select column as Series at index location.
-
#to_struct(name) ⇒ Series
Convert a
DataFrameto aSeriesof typeStruct. -
#top_k(k, by:, reverse: false) ⇒ DataFrame
Return the
klargest rows. -
#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame
Transpose a DataFrame over the diagonal.
-
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ DataFrame
Drop duplicate rows from this DataFrame.
-
#unnest(names) ⇒ DataFrame
Decompose a struct into its fields.
-
#unpivot(on, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame
(also: #melt)
Unpivot a DataFrame from wide to long format.
-
#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame
Unstack a long table to a wide form without doing an aggregation.
-
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ DataFrame
Update the values in this
DataFramewith the values inother. -
#upsample(time_column:, every:, by: nil, maintain_order: false) ⇒ DataFrame
Upsample a DataFrame at a regular frequency.
-
#var(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their variance value.
-
#vstack(df, in_place: false) ⇒ DataFrame
Grow this DataFrame vertically by stacking a DataFrame to it.
-
#width ⇒ Integer
Get the width of the DataFrame.
-
#with_column(column) ⇒ DataFrame
Return a new DataFrame with the column added or replaced.
-
#with_columns(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
-
#with_columns_seq(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
-
#with_row_index(name: "index", offset: 0) ⇒ DataFrame
(also: #with_row_count)
Add a column at index 0 that counts the rows.
-
#write_avro(file, compression = "uncompressed", name: "") ⇒ nil
Write to Apache Avro file.
-
#write_csv(file = nil, include_header: true, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil) ⇒ String?
Write to comma-separated values (CSV) file.
-
#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer
Write the data in a Polars DataFrame to a database.
-
#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil
Write DataFrame as delta table.
-
#write_iceberg(target, mode:) ⇒ nil
Write DataFrame to an Iceberg table.
-
#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2) ⇒ nil
Write to Arrow IPC binary stream or Feather file.
-
#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object
Write to Arrow IPC record batch stream.
-
#write_json(file = nil) ⇒ nil
Serialize to JSON representation.
-
#write_ndjson(file = nil) ⇒ nil
Serialize to newline delimited JSON representation.
-
#write_parquet(file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil) ⇒ nil
Write to Apache Parquet file.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ DataFrame
Create a new DataFrame.
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# File 'lib/polars/data_frame.rb', line 50 def initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) if defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || data.is_a?(ActiveRecord::Result)) raise ArgumentError, "Use read_database instead" end if data.nil? self._df = self.class.hash_to_rbdf({}, schema: schema, schema_overrides: schema_overrides) elsif data.is_a?(Hash) data = data.transform_keys { |v| v.is_a?(Symbol) ? v.to_s : v } self._df = self.class.hash_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, nan_to_null: nan_to_null) elsif data.is_a?(::Array) self._df = self.class.sequence_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length) elsif data.is_a?(Series) self._df = self.class.series_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict) elsif data.respond_to?(:arrow_c_stream) # This uses the fact that RbSeries.from_arrow_c_stream will create a # struct-typed Series. Then we unpack that to a DataFrame. tmp_col_name = "" s = Utils.wrap_s(RbSeries.from_arrow_c_stream(data)) self._df = s.to_frame(tmp_col_name).unnest(tmp_col_name)._df else raise ArgumentError, "DataFrame constructor called with unsupported type; got #{data.class.name}" end end |
Class Method Details
.deserialize(source) ⇒ DataFrame
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Read a serialized DataFrame from a file.
102 103 104 105 106 107 108 109 110 |
# File 'lib/polars/data_frame.rb', line 102 def self.deserialize(source) if Utils.pathlike?(source) source = Utils.normalize_filepath(source) end deserializer = RbDataFrame.method(:deserialize_binary) _from_rbdf(deserializer.(source)) end |
Instance Method Details
#!=(other) ⇒ DataFrame
Not equal.
262 263 264 |
# File 'lib/polars/data_frame.rb', line 262 def !=(other) _comp(other, "neq") end |
#%(other) ⇒ DataFrame
Returns the modulo.
345 346 347 348 349 350 351 352 |
# File 'lib/polars/data_frame.rb', line 345 def %(other) if other.is_a?(DataFrame) return _from_rbdf(_df.rem_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.rem(other._s)) end |
#*(other) ⇒ DataFrame
Performs multiplication.
297 298 299 300 301 302 303 304 |
# File 'lib/polars/data_frame.rb', line 297 def *(other) if other.is_a?(DataFrame) return _from_rbdf(_df.mul_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.mul(other._s)) end |
#+(other) ⇒ DataFrame
Performs addition.
321 322 323 324 325 326 327 328 |
# File 'lib/polars/data_frame.rb', line 321 def +(other) if other.is_a?(DataFrame) return _from_rbdf(_df.add_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.add(other._s)) end |
#-(other) ⇒ DataFrame
Performs subtraction.
333 334 335 336 337 338 339 340 |
# File 'lib/polars/data_frame.rb', line 333 def -(other) if other.is_a?(DataFrame) return _from_rbdf(_df.sub_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.sub(other._s)) end |
#/(other) ⇒ DataFrame
Performs division.
309 310 311 312 313 314 315 316 |
# File 'lib/polars/data_frame.rb', line 309 def /(other) if other.is_a?(DataFrame) return _from_rbdf(_df.div_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.div(other._s)) end |
#<(other) ⇒ DataFrame
Less than.
276 277 278 |
# File 'lib/polars/data_frame.rb', line 276 def <(other) _comp(other, "lt") end |
#<=(other) ⇒ DataFrame
Less than or equal.
290 291 292 |
# File 'lib/polars/data_frame.rb', line 290 def <=(other) _comp(other, "lt_eq") end |
#==(other) ⇒ DataFrame
Equal.
255 256 257 |
# File 'lib/polars/data_frame.rb', line 255 def ==(other) _comp(other, "eq") end |
#>(other) ⇒ DataFrame
Greater than.
269 270 271 |
# File 'lib/polars/data_frame.rb', line 269 def >(other) _comp(other, "gt") end |
#>=(other) ⇒ DataFrame
Greater than or equal.
283 284 285 |
# File 'lib/polars/data_frame.rb', line 283 def >=(other) _comp(other, "gt_eq") end |
#[](*args) ⇒ Object
Returns subset of the DataFrame.
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 |
# File 'lib/polars/data_frame.rb', line 386 def [](*args) if args.size == 2 row_selection, col_selection = args # df[.., unknown] if row_selection.is_a?(Range) # multiple slices # df[.., ..] if col_selection.is_a?(Range) raise Todo end end # df[2, ..] (select row as df) if row_selection.is_a?(Integer) if col_selection.is_a?(::Array) df = self[0.., col_selection] return df.slice(row_selection, 1) end # df[2, "a"] if col_selection.is_a?(::String) || col_selection.is_a?(Symbol) return self[col_selection][row_selection] end end # column selection can be "a" and ["a", "b"] if col_selection.is_a?(::String) || col_selection.is_a?(Symbol) col_selection = [col_selection] end # df[.., 1] if col_selection.is_a?(Integer) series = to_series(col_selection) return series[row_selection] end if col_selection.is_a?(::Array) # df[.., [1, 2]] if Utils.is_int_sequence(col_selection) series_list = col_selection.map { |i| to_series(i) } df = self.class.new(series_list) return df[row_selection] end end df = self[col_selection] return df[row_selection] elsif args.size == 1 item = args[0] # select single column # df["foo"] if item.is_a?(::String) || item.is_a?(Symbol) return Utils.wrap_s(_df.get_column(item.to_s)) end # df[idx] if item.is_a?(Integer) return slice(_pos_idx(item, 0), 1) end # df[..] if item.is_a?(Range) return Slice.new(self).apply(item) end if item.is_a?(::Array) && item.all? { |v| Utils.strlike?(v) } # select multiple columns # df[["foo", "bar"]] return _from_rbdf(_df.select(item.map(&:to_s))) end if Utils.is_int_sequence(item) item = Series.new("", item) end if item.is_a?(Series) dtype = item.dtype if dtype == String return _from_rbdf(_df.select(item)) elsif dtype == UInt32 return _from_rbdf(_df.take_with_series(item._s)) elsif [UInt8, UInt16, UInt64, Int8, Int16, Int32, Int64].include?(dtype) return _from_rbdf( _df.take_with_series(_pos_idxs(item, 0)._s) ) end end end # Ruby-specific if item.is_a?(Expr) || item.is_a?(Series) return filter(item) end raise ArgumentError, "Cannot get item of type: #{item.class.name}" end |
#[]=(*key, value) ⇒ Object
Set item.
488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 |
# File 'lib/polars/data_frame.rb', line 488 def []=(*key, value) if key.length == 1 key = key.first elsif key.length != 2 raise ArgumentError, "wrong number of arguments (given #{key.length + 1}, expected 2..3)" end if Utils.strlike?(key) if value.is_a?(::Array) || (defined?(Numo::NArray) && value.is_a?(Numo::NArray)) value = Series.new(value) elsif !value.is_a?(Series) value = Polars.lit(value) end self._df = with_column(value.alias(key.to_s))._df elsif key.is_a?(::Array) row_selection, col_selection = key if Utils.strlike?(col_selection) s = self[col_selection] elsif col_selection.is_a?(Integer) raise Todo else raise ArgumentError, "column selection not understood: #{col_selection}" end s[row_selection] = value if col_selection.is_a?(Integer) replace_column(col_selection, s) elsif Utils.strlike?(col_selection) replace(col_selection, s) end else raise Todo end end |
#bottom_k(k, by:, reverse: false) ⇒ DataFrame
Return the k smallest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call sort after this function if you wish the
output to be sorted.
2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 |
# File 'lib/polars/data_frame.rb', line 2083 def bottom_k( k, by:, reverse: false ) lazy .bottom_k(k, by: by, reverse: reverse) .collect( # optimizations=QueryOptFlags( # projection_pushdown=False, # predicate_pushdown=False, # comm_subplan_elim=False, # slice_pushdown=True, # ) ) end |
#cast(dtypes, strict: true) ⇒ DataFrame
Cast DataFrame column(s) to the specified dtype(s).
3770 3771 3772 |
# File 'lib/polars/data_frame.rb', line 3770 def cast(dtypes, strict: true) lazy.cast(dtypes, strict: strict).collect(_eager: true) end |
#clear(n = 0) ⇒ DataFrame Also known as: cleared
Create an empty copy of the current DataFrame.
Returns a DataFrame with identical schema but no data.
3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 |
# File 'lib/polars/data_frame.rb', line 3810 def clear(n = 0) if n == 0 _from_rbdf(_df.clear) elsif n > 0 || len > 0 self.class.new( schema.to_h { |nm, tp| [nm, Series.new(nm, [], dtype: tp).extend_constant(nil, n)] } ) else clone end end |
#collect_schema ⇒ Schema
This method is included to facilitate writing code that is generic for both DataFrame and LazyFrame.
Get an ordered mapping of column names to their data type.
565 566 567 |
# File 'lib/polars/data_frame.rb', line 565 def collect_schema Schema.new(columns.zip(dtypes), check_dtypes: false) end |
#columns ⇒ Array
Get column names.
172 173 174 |
# File 'lib/polars/data_frame.rb', line 172 def columns _df.columns end |
#columns=(columns) ⇒ Object
Change the column names of the DataFrame.
205 206 207 |
# File 'lib/polars/data_frame.rb', line 205 def columns=(columns) _df.set_column_names(columns) end |
#delete(name) ⇒ Series
Drop in place if exists.
3717 3718 3719 |
# File 'lib/polars/data_frame.rb', line 3717 def delete(name) drop_in_place(name) if include?(name) end |
#describe ⇒ DataFrame
Summary statistics for a DataFrame.
1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 |
# File 'lib/polars/data_frame.rb', line 1718 def describe describe_cast = lambda do |stat| columns = [] self.columns.each_with_index do |s, i| if self[s].is_numeric || self[s].is_boolean columns << stat[0.., i].cast(:f64) else # for dates, strings, etc, we cast to string so that all # statistics can be shown columns << stat[0.., i].cast(:str) end end self.class.new(columns) end summary = _from_rbdf( Polars.concat( [ describe_cast.( self.class.new(columns.to_h { |c| [c, [height]] }) ), describe_cast.(null_count), describe_cast.(mean), describe_cast.(std), describe_cast.(min), describe_cast.(max), describe_cast.(median) ] )._df ) summary.insert_column( 0, Polars::Series.new( "describe", ["count", "null_count", "mean", "std", "min", "max", "median"], ) ) summary end |
#drop(*columns) ⇒ DataFrame
Remove column from DataFrame and return as new.
3657 3658 3659 |
# File 'lib/polars/data_frame.rb', line 3657 def drop(*columns) lazy.drop(*columns).collect(_eager: true) end |
#drop_in_place(name) ⇒ Series
Drop in place.
3685 3686 3687 |
# File 'lib/polars/data_frame.rb', line 3685 def drop_in_place(name) Utils.wrap_s(_df.drop_in_place(name)) end |
#drop_nans(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
2332 2333 2334 |
# File 'lib/polars/data_frame.rb', line 2332 def drop_nans(subset: nil) lazy.drop_nans(subset: subset).collect(_eager: true) end |
#drop_nulls(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
2377 2378 2379 |
# File 'lib/polars/data_frame.rb', line 2377 def drop_nulls(subset: nil) lazy.drop_nulls(subset: subset).collect(_eager: true) end |
#dtypes ⇒ Array
Get dtypes of columns in DataFrame. Dtypes can also be found in column headers when printing the DataFrame.
223 224 225 |
# File 'lib/polars/data_frame.rb', line 223 def dtypes _df.dtypes end |
#each(&block) ⇒ Object
Returns an enumerator.
379 380 381 |
# File 'lib/polars/data_frame.rb', line 379 def each(&block) get_columns.each(&block) end |
#each_row(named: true, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
5818 5819 5820 |
# File 'lib/polars/data_frame.rb', line 5818 def each_row(named: true, buffer_size: 500, &block) iter_rows(named: named, buffer_size: buffer_size, &block) end |
#equals(other, null_equal: true) ⇒ Boolean Also known as: frame_equal
Check if DataFrame is equal to other.
2128 2129 2130 |
# File 'lib/polars/data_frame.rb', line 2128 def equals(other, null_equal: true) _df.equals(other._df, null_equal) end |
#estimated_size(unit = "b") ⇒ Numeric
Return an estimation of the total (heap) allocated size of the DataFrame.
Estimated size is given in the specified unit (bytes by default).
This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, StructArray's size is an upper bound.
When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.
FFI buffers are included in this estimation.
1341 1342 1343 1344 |
# File 'lib/polars/data_frame.rb', line 1341 def estimated_size(unit = "b") sz = _df.estimated_size Utils.scale_bytes(sz, to: unit) end |
#explode(columns) ⇒ DataFrame
Explode DataFrame to long format by exploding a column with Lists.
4059 4060 4061 |
# File 'lib/polars/data_frame.rb', line 4059 def explode(columns) lazy.explode(columns).collect(no_optimization: true) end |
#extend(other) ⇒ DataFrame
Extend the memory backed by this DataFrame with the values from other.
Different from vstack which adds the chunks from other to the chunks of this
DataFrame extend appends the data from other to the underlying memory
locations and thus may cause a reallocation.
If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.
Prefer extend over vstack when you want to do a query after a single append.
For instance during online operations where you add n rows and rerun a query.
Prefer vstack over extend when you want to append many times before doing a
query. For instance when you read in multiple files and when to store them in a
single DataFrame. In the latter case, finish the sequence of vstack
operations with a rechunk.
3597 3598 3599 3600 |
# File 'lib/polars/data_frame.rb', line 3597 def extend(other) _df.extend(other._df) self end |
#fill_nan(fill_value) ⇒ DataFrame
Note that floating point NaNs (Not a Number) are not missing values!
To replace missing values, use fill_null.
Fill floating point NaN values by an Expression evaluation.
4024 4025 4026 |
# File 'lib/polars/data_frame.rb', line 4024 def fill_nan(fill_value) lazy.fill_nan(fill_value).collect(no_optimization: true) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame
Fill null values using the specified value or strategy.
3984 3985 3986 3987 3988 3989 3990 3991 |
# File 'lib/polars/data_frame.rb', line 3984 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) _from_rbdf( lazy .fill_null(value, strategy: strategy, limit: limit, matches_supertype: matches_supertype) .collect(no_optimization: true) ._df ) end |
#filter(predicate) ⇒ DataFrame
Filter the rows in the DataFrame based on a predicate expression.
1564 1565 1566 |
# File 'lib/polars/data_frame.rb', line 1564 def filter(predicate) lazy.filter(predicate).collect end |
#flags ⇒ Hash
Get flags that are set on the columns of this DataFrame.
230 231 232 |
# File 'lib/polars/data_frame.rb', line 230 def flags columns.to_h { |name| [name, self[name].flags] } end |
#fold ⇒ Series
Apply a horizontal reduction on a DataFrame.
This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).
An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:
i8 + str = str f32 + i64 = f32 f32 + f64 = f64
5548 5549 5550 5551 5552 5553 5554 5555 |
# File 'lib/polars/data_frame.rb', line 5548 def fold acc = to_series(0) 1.upto(width - 1) do |i| acc = yield(acc, to_series(i)) end acc end |
#gather_every(n, offset = 0) ⇒ DataFrame Also known as: take_every
Take every nth row in the DataFrame and return as a new DataFrame.
5939 5940 5941 |
# File 'lib/polars/data_frame.rb', line 5939 def gather_every(n, offset = 0) select(F.col("*").gather_every(n, offset)) end |
#get_column(name) ⇒ Series
Get a single column as Series by name.
3901 3902 3903 |
# File 'lib/polars/data_frame.rb', line 3901 def get_column(name) self[name] end |
#get_column_index(name) ⇒ Series Also known as: find_idx_by_name
Find the index of a column by name.
1771 1772 1773 |
# File 'lib/polars/data_frame.rb', line 1771 def get_column_index(name) _df.get_column_index(name) end |
#get_columns ⇒ Array
Get the DataFrame as a Array of Series.
3879 3880 3881 |
# File 'lib/polars/data_frame.rb', line 3879 def get_columns _df.get_columns.map { |s| Utils.wrap_s(s) } end |
#group_by(by, maintain_order: false) ⇒ GroupBy Also known as: groupby, group
Start a group by operation.
2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 |
# File 'lib/polars/data_frame.rb', line 2485 def group_by(by, maintain_order: false) if !Utils.bool?(maintain_order) raise TypeError, "invalid input for group_by arg `maintain_order`: #{maintain_order}." end GroupBy.new( self, by, maintain_order: maintain_order ) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ DataFrame Also known as: groupby_dynamic
Group based on a time value (or index value of type :i32, :i64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every, period and offset arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 |
# File 'lib/polars/data_frame.rb', line 2841 def group_by_dynamic( index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window" ) DynamicGroupBy.new( self, index_column, every, period, offset, truncate, include_boundaries, closed, by, start_by ) end |
#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series
Hash and combine the rows in this DataFrame.
The hash value is of type :u64.
5976 5977 5978 5979 5980 5981 5982 |
# File 'lib/polars/data_frame.rb', line 5976 def hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) k0 = seed k1 = seed_1.nil? ? seed : seed_1 k2 = seed_2.nil? ? seed : seed_2 k3 = seed_3.nil? ? seed : seed_3 Utils.wrap_s(_df.hash_rows(k0, k1, k2, k3)) end |
#head(n = 5) ⇒ DataFrame
Get the first n rows.
2255 2256 2257 |
# File 'lib/polars/data_frame.rb', line 2255 def head(n = 5) _from_rbdf(_df.head(n)) end |
#height ⇒ Integer Also known as: count, length, size
Get the height of the DataFrame.
139 140 141 |
# File 'lib/polars/data_frame.rb', line 139 def height _df.height end |
#hstack(columns, in_place: false) ⇒ DataFrame
Return a new DataFrame grown horizontally by stacking multiple Series to it.
3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 |
# File 'lib/polars/data_frame.rb', line 3499 def hstack(columns, in_place: false) if !columns.is_a?(::Array) columns = columns.get_columns end if in_place _df.hstack_mut(columns.map(&:_s)) self else _from_rbdf(_df.hstack(columns.map(&:_s))) end end |
#include?(name) ⇒ Boolean
Check if DataFrame includes column.
372 373 374 |
# File 'lib/polars/data_frame.rb', line 372 def include?(name) columns.include?(name) end |
#insert_column(index, series) ⇒ DataFrame Also known as: insert_at_idx
Insert a Series at a certain column index. This operation is in place.
1517 1518 1519 1520 1521 1522 1523 |
# File 'lib/polars/data_frame.rb', line 1517 def insert_column(index, series) if index < 0 index = columns.length + index end _df.insert_column(index, series._s) self end |
#interpolate ⇒ DataFrame
Interpolate intermediate values. The interpolation method is linear.
6009 6010 6011 |
# File 'lib/polars/data_frame.rb', line 6009 def interpolate select(F.col("*").interpolate) end |
#is_duplicated ⇒ Series
Get a mask of all duplicated rows in this DataFrame.
4541 4542 4543 |
# File 'lib/polars/data_frame.rb', line 4541 def is_duplicated Utils.wrap_s(_df.is_duplicated) end |
#is_empty ⇒ Boolean Also known as: empty?
Check if the dataframe is empty.
6023 6024 6025 |
# File 'lib/polars/data_frame.rb', line 6023 def is_empty height == 0 end |
#is_unique ⇒ Series
Get a mask of all unique rows in this DataFrame.
4566 4567 4568 |
# File 'lib/polars/data_frame.rb', line 4566 def is_unique Utils.wrap_s(_df.is_unique) end |
#item ⇒ Object
Return the dataframe as a scalar.
Equivalent to df[0,0], with a check that the shape is (1,1).
580 581 582 583 584 585 |
# File 'lib/polars/data_frame.rb', line 580 def item if shape != [1, 1] raise ArgumentError, "Can only call .item if the dataframe is of shape (1,1), dataframe is of shape #{shape}" end self[0, 0] end |
#iter_columns ⇒ Object
Consider whether you can use all instead.
If you can, it will be more efficient.
Returns an iterator over the columns of this DataFrame.
5868 5869 5870 5871 5872 5873 5874 |
# File 'lib/polars/data_frame.rb', line 5868 def iter_columns return to_enum(:iter_columns) unless block_given? _df.get_columns.each do |s| yield Utils.wrap_s(s) end end |
#iter_rows(named: false, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 |
# File 'lib/polars/data_frame.rb', line 5771 def iter_rows(named: false, buffer_size: 500, &block) return to_enum(:iter_rows, named: named, buffer_size: buffer_size) unless block_given? # load into the local namespace for a modest performance boost in the hot loops columns = self.columns # note: buffering rows results in a 2-4x speedup over individual calls # to ".row(i)", so it should only be disabled in extremely specific cases. if buffer_size offset = 0 while offset < height zerocopy_slice = slice(offset, buffer_size) rows_chunk = zerocopy_slice.rows(named: false) if named rows_chunk.each do |row| yield columns.zip(row).to_h end else rows_chunk.each(&block) end offset += buffer_size end elsif named height.times do |i| yield columns.zip(row(i)).to_h end else height.times do |i| yield row(i) end end end |
#iter_slices(n_rows: 10_000) ⇒ Object
Returns a non-copying iterator of slices over the underlying DataFrame.
5896 5897 5898 5899 5900 5901 5902 5903 5904 |
# File 'lib/polars/data_frame.rb', line 5896 def iter_slices(n_rows: 10_000) return to_enum(:iter_slices, n_rows: n_rows) unless block_given? offset = 0 while offset < height yield slice(offset, n_rows) offset += n_rows end end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame
Join in SQL-like fashion.
3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 |
# File 'lib/polars/data_frame.rb', line 3230 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil ) lazy .join( other.lazy, left_on: left_on, right_on: right_on, on: on, how: how, suffix: suffix, validate: validate, join_nulls: join_nulls, coalesce: coalesce, maintain_order: maintain_order ) .collect(no_optimization: true) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 |
# File 'lib/polars/data_frame.rb', line 3064 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true ) lazy .join_asof( other.lazy, left_on: left_on, right_on: right_on, on: on, by_left: by_left, by_right: by_right, by: by, strategy: strategy, suffix: suffix, tolerance: tolerance, allow_parallel: allow_parallel, force_parallel: force_parallel, coalesce: coalesce, allow_exact_matches: allow_exact_matches, check_sortedness: check_sortedness ) .collect(no_optimization: true) end |
#join_where(other, *predicates, suffix: "_right") ⇒ DataFrame
The row order of the input DataFrames is not preserved.
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 |
# File 'lib/polars/data_frame.rb', line 3337 def join_where( other, *predicates, suffix: "_right" ) Utils.require_same_type(self, other) lazy .join_where( other.lazy, *predicates, suffix: suffix ) .collect(_eager: true) end |
#lazy ⇒ LazyFrame
Start a lazy query from this point.
4573 4574 4575 |
# File 'lib/polars/data_frame.rb', line 4573 def lazy wrap_ldf(_df.lazy) end |
#limit(n = 5) ⇒ DataFrame
Get the first n rows.
Alias for #head.
2224 2225 2226 |
# File 'lib/polars/data_frame.rb', line 2224 def limit(n = 5) head(n) end |
#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ Object Also known as: apply
The frame-level apply cannot track column names (as the UDF is a black-box
that may arbitrarily drop, rearrange, transform, or add new columns); if you
want to apply a UDF such that column names are preserved, you should use the
expression-level apply syntax instead.
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
The UDF will receive each row as a tuple of values: udf(row).
Implementing logic using a Ruby function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:
- The native expression engine runs in Rust; UDFs run in Ruby.
- Use of Ruby UDFs forces the DataFrame to be materialized in memory.
- Polars-native expressions can be parallelised (UDFs cannot).
- Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
3413 3414 3415 3416 3417 3418 3419 3420 |
# File 'lib/polars/data_frame.rb', line 3413 def map_rows(return_dtype: nil, inference_size: 256, &f) out, is_df = _df.map_rows(f, return_dtype, inference_size) if is_df _from_rbdf(out) else _from_rbdf(Utils.wrap_s(out).to_frame._df) end end |
#max ⇒ DataFrame
Aggregate the columns of this DataFrame to their maximum value.
4878 4879 4880 |
# File 'lib/polars/data_frame.rb', line 4878 def max lazy.max.collect(_eager: true) end |
#max_horizontal ⇒ Series
Get the maximum value horizontally across columns.
4902 4903 4904 |
# File 'lib/polars/data_frame.rb', line 4902 def max_horizontal select(max: F.max_horizontal(F.all)).to_series end |
#mean ⇒ DataFrame
Aggregate the columns of this DataFrame to their mean value.
5034 5035 5036 |
# File 'lib/polars/data_frame.rb', line 5034 def mean lazy.mean.collect(_eager: true) end |
#mean_horizontal(ignore_nulls: true) ⇒ Series
Take the mean of all values horizontally across columns.
5062 5063 5064 5065 5066 |
# File 'lib/polars/data_frame.rb', line 5062 def mean_horizontal(ignore_nulls: true) select( mean: F.mean_horizontal(F.all, ignore_nulls: ignore_nulls) ).to_series end |
#median ⇒ DataFrame
Aggregate the columns of this DataFrame to their median value.
5172 5173 5174 |
# File 'lib/polars/data_frame.rb', line 5172 def median lazy.median.collect(_eager: true) end |
#merge_sorted(other, key) ⇒ DataFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both DataFrames must be equal.
6138 6139 6140 |
# File 'lib/polars/data_frame.rb', line 6138 def merge_sorted(other, key) lazy.merge_sorted(other.lazy, key).collect(_eager: true) end |
#min ⇒ DataFrame
Aggregate the columns of this DataFrame to their minimum value.
4928 4929 4930 |
# File 'lib/polars/data_frame.rb', line 4928 def min lazy.min.collect(_eager: true) end |
#min_horizontal ⇒ Series
Get the minimum value horizontally across columns.
4952 4953 4954 |
# File 'lib/polars/data_frame.rb', line 4952 def min_horizontal select(min: F.min_horizontal(F.all)).to_series end |
#n_chunks(strategy: "first") ⇒ Object
Get number of chunks used by the ChunkedArrays of this DataFrame.
4846 4847 4848 4849 4850 4851 4852 4853 4854 |
# File 'lib/polars/data_frame.rb', line 4846 def n_chunks(strategy: "first") if strategy == "first" _df.n_chunks elsif strategy == "all" get_columns.map(&:n_chunks) else raise ArgumentError, "Strategy: '{strategy}' not understood. Choose one of {{'first', 'all'}}" end end |
#n_unique(subset: nil) ⇒ DataFrame
Return the number of unique rows, or the number of unique row-subsets.
5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 |
# File 'lib/polars/data_frame.rb', line 5351 def n_unique(subset: nil) if subset.is_a?(StringIO) subset = [Polars.col(subset)] elsif subset.is_a?(Expr) subset = [subset] end if subset.is_a?(::Array) && subset.length == 1 expr = Utils.wrap_expr(Utils.parse_into_expression(subset[0], str_as_lit: false)) else struct_fields = subset.nil? ? Polars.all : subset expr = Polars.struct(struct_fields) end df = lazy.select(expr.n_unique).collect df.is_empty ? 0 : df.row(0)[0] end |
#null_count ⇒ DataFrame
Create a new DataFrame that shows the null counts per column.
5401 5402 5403 |
# File 'lib/polars/data_frame.rb', line 5401 def null_count _from_rbdf(_df.null_count) end |
#partition_by(groups, maintain_order: true, include_key: true, as_dict: false) ⇒ Object
Split into multiple DataFrames partitioned by groups.
4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 |
# File 'lib/polars/data_frame.rb', line 4414 def partition_by(groups, maintain_order: true, include_key: true, as_dict: false) if groups.is_a?(::String) groups = [groups] elsif !groups.is_a?(::Array) groups = Array(groups) end if as_dict out = {} if groups.length == 1 _df.partition_by(groups, maintain_order, include_key).each do |df| df = _from_rbdf(df) out[df[groups][0, 0]] = df end else _df.partition_by(groups, maintain_order, include_key).each do |df| df = _from_rbdf(df) out[df[groups].row(0)] = df end end out else _df.partition_by(groups, maintain_order, include_key).map { |df| _from_rbdf(df) } end end |
#pipe(func, *args, **kwargs, &block) ⇒ Object
It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See #lazy.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
2417 2418 2419 |
# File 'lib/polars/data_frame.rb', line 2417 def pipe(func, *args, **kwargs, &block) func.call(self, *args, **kwargs, &block) end |
#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 |
# File 'lib/polars/data_frame.rb', line 4103 def pivot( on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_" ) index = Utils.(self, index) on = Utils.(self, on) if !values.nil? values = Utils.(self, values) end if aggregate_function.is_a?(::String) case aggregate_function when "first" aggregate_expr = F.element.first._rbexpr when "sum" aggregate_expr = F.element.sum._rbexpr when "max" aggregate_expr = F.element.max._rbexpr when "min" aggregate_expr = F.element.min._rbexpr when "mean" aggregate_expr = F.element.mean._rbexpr when "median" aggregate_expr = F.element.median._rbexpr when "last" aggregate_expr = F.element.last._rbexpr when "len" aggregate_expr = F.len._rbexpr when "count" warn "`aggregate_function: \"count\"` input for `pivot` is deprecated. Use `aggregate_function: \"len\"` instead." aggregate_expr = F.len._rbexpr else raise ArgumentError, "Argument aggregate fn: '#{aggregate_fn}' was not expected." end elsif aggregate_function.nil? aggregate_expr = nil else aggregate_expr = aggregate_function._rbexpr end _from_rbdf( _df.pivot_expr( on, index, values, maintain_order, sort_columns, aggregate_expr, separator ) ) end |
#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Vega::LiteChart Originally defined in module Plot
Plot data.
#product ⇒ DataFrame
Aggregate the columns of this DataFrame to their product values.
5198 5199 5200 |
# File 'lib/polars/data_frame.rb', line 5198 def product select(Polars.all.product) end |
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Aggregate the columns of this DataFrame to their quantile value.
5229 5230 5231 |
# File 'lib/polars/data_frame.rb', line 5229 def quantile(quantile, interpolation: "nearest") lazy.quantile(quantile, interpolation: interpolation).collect(_eager: true) end |
#rechunk ⇒ DataFrame
This will make sure all subsequent operations have optimal and predictable performance.
5375 5376 5377 |
# File 'lib/polars/data_frame.rb', line 5375 def rechunk _from_rbdf(_df.rechunk) end |
#remove(*predicates, **constraints) ⇒ DataFrame
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to True are retained
(this includes rows where the predicate evaluates as null).
1679 1680 1681 1682 1683 1684 1685 1686 |
# File 'lib/polars/data_frame.rb', line 1679 def remove( *predicates, **constraints ) lazy .remove(*predicates, **constraints) .collect(_eager: true) end |
#rename(mapping, strict: true) ⇒ DataFrame
Rename column names.
1466 1467 1468 |
# File 'lib/polars/data_frame.rb', line 1466 def rename(mapping, strict: true) lazy.rename(mapping, strict: strict).collect(no_optimization: true) end |
#replace(column, new_col) ⇒ DataFrame
Replace a column by a new Series.
2157 2158 2159 2160 |
# File 'lib/polars/data_frame.rb', line 2157 def replace(column, new_col) _df.replace(column.to_s, new_col._s) self end |
#replace_column(index, series) ⇒ DataFrame Also known as: replace_at_idx
Replace a column at an index location.
1806 1807 1808 1809 1810 1811 1812 |
# File 'lib/polars/data_frame.rb', line 1806 def replace_column(index, series) if index < 0 index = columns.length + index end _df.replace_column(index, series._s) self end |
#reverse ⇒ DataFrame
Reverse the DataFrame.
1431 1432 1433 |
# File 'lib/polars/data_frame.rb', line 1431 def reverse select(Polars.col("*").reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ RollingGroupBy Also known as: groupby_rolling, group_by_rolling
Create rolling groups based on a time column.
Also works for index values of type :i32 or :i64.
Different from a dynamic_group_by the windows are now determined by the
individual values and are not of constant intervals. For constant intervals use
group_by_dynamic
The period and offset arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2582 2583 2584 2585 2586 2587 2588 2589 2590 |
# File 'lib/polars/data_frame.rb', line 2582 def rolling( index_column:, period:, offset: nil, closed: "right", by: nil ) RollingGroupBy.new(self, index_column, period, offset, closed, by) end |
#row(index = nil, by_predicate: nil, named: false) ⇒ Object
The index and by_predicate params are mutually exclusive. Additionally,
to ensure clarity, the by_predicate parameter must be supplied by keyword.
When using by_predicate it is an error condition if anything other than
one row is returned; more than one row raises TooManyRowsReturned, and
zero rows will raise NoRowsReturned (both inherit from RowsException).
Get a row as tuple, either by index or by predicate.
5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 |
# File 'lib/polars/data_frame.rb', line 5596 def row(index = nil, by_predicate: nil, named: false) if !index.nil? && !by_predicate.nil? raise ArgumentError, "Cannot set both 'index' and 'by_predicate'; mutually exclusive" elsif index.is_a?(Expr) raise TypeError, "Expressions should be passed to the 'by_predicate' param" end if !index.nil? row = _df.row_tuple(index) if named columns.zip(row).to_h else row end elsif !by_predicate.nil? if !by_predicate.is_a?(Expr) raise TypeError, "Expected by_predicate to be an expression; found #{by_predicate.class.name}" end rows = filter(by_predicate).rows n_rows = rows.length if n_rows > 1 raise TooManyRowsReturned, "Predicate #{by_predicate} returned #{n_rows} rows" elsif n_rows == 0 raise NoRowsReturned, "Predicate #{by_predicate} returned no rows" end row = rows[0] if named columns.zip(row).to_h else row end else raise ArgumentError, "One of 'index' or 'by_predicate' must be set" end end |
#rows(named: false) ⇒ Array
Convert columnar data to rows as Ruby arrays.
5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 |
# File 'lib/polars/data_frame.rb', line 5653 def rows(named: false) if named columns = self.columns _df.row_tuples.map do |v| columns.zip(v).to_h end else _df.row_tuples end end |
#rows_by_key(key, named: false, include_key: false, unique: false) ⇒ Hash
Convert columnar data to rows as Ruby arrays in a hash keyed by some column.
This method is like rows, but instead of returning rows in a flat list, rows
are grouped by the values in the key column(s) and returned as a hash.
Note that this method should not be used in place of native operations, due to the high cost of materializing all frame data out into a hash; it should be used only when you need to move the values out into a Ruby data structure or other object that cannot operate directly with Polars/Arrow.
5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 |
# File 'lib/polars/data_frame.rb', line 5720 def rows_by_key(key, named: false, include_key: false, unique: false) key = Utils.(self, key) keys = key.size == 1 ? get_column(key[0]) : select(key).iter_rows if include_key values = self else data_cols = schema.keys - key values = select(data_cols) end zipped = keys.each.zip(values.iter_rows(named: named)) # if unique, we expect to write just one entry per key; otherwise, we're # returning a list of rows for each key, so append into a hash of arrays. if unique zipped.to_h else zipped.each_with_object({}) { |(key, data), h| (h[key] ||= []) << data } end end |
#sample(n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame
Sample from this DataFrame.
5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 |
# File 'lib/polars/data_frame.rb', line 5441 def sample( n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil ) if !n.nil? && !frac.nil? raise ArgumentError, "cannot specify both `n` and `frac`" end if n.nil? && !frac.nil? frac = Series.new("frac", [frac]) unless frac.is_a?(Series) return _from_rbdf( _df.sample_frac(frac._s, with_replacement, shuffle, seed) ) end if n.nil? n = 1 end n = Series.new("", [n]) unless n.is_a?(Series) _from_rbdf(_df.sample_n(n._s, with_replacement, shuffle, seed)) end |
#schema ⇒ Hash
Get the schema.
248 249 250 |
# File 'lib/polars/data_frame.rb', line 248 def schema columns.zip(dtypes).to_h end |
#select(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
4665 4666 4667 |
# File 'lib/polars/data_frame.rb', line 4665 def select(*exprs, **named_exprs) lazy.select(*exprs, **named_exprs).collect(_eager: true) end |
#select_seq(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
4683 4684 4685 4686 4687 |
# File 'lib/polars/data_frame.rb', line 4683 def select_seq(*exprs, **named_exprs) lazy .select_seq(*exprs, **named_exprs) .collect(_eager: true) end |
#serialize(file = nil) ⇒ Object
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Serialize this DataFrame to a file or string.
699 700 701 702 703 |
# File 'lib/polars/data_frame.rb', line 699 def serialize(file = nil) serializer = _df.method(:serialize_binary) Utils.serialize_polars_object(serializer, file) end |
#set_sorted(column, descending: false) ⇒ DataFrame
This can lead to incorrect results if the data is NOT sorted! Use with care!
Flag a column as sorted.
This can speed up future operations.
6155 6156 6157 6158 6159 6160 6161 6162 |
# File 'lib/polars/data_frame.rb', line 6155 def set_sorted( column, descending: false ) lazy .set_sorted(column, descending: descending) .collect(no_optimization: true) end |
#shape ⇒ Array
Get the shape of the DataFrame.
127 128 129 |
# File 'lib/polars/data_frame.rb', line 127 def shape _df.shape end |
#shift(n, fill_value: nil) ⇒ DataFrame
Shift values by the given period.
4483 4484 4485 |
# File 'lib/polars/data_frame.rb', line 4483 def shift(n, fill_value: nil) lazy.shift(n, fill_value: fill_value).collect(_eager: true) end |
#shift_and_fill(periods, fill_value) ⇒ DataFrame
Shift the values by a given period and fill the resulting null values.
4516 4517 4518 |
# File 'lib/polars/data_frame.rb', line 4516 def shift_and_fill(periods, fill_value) shift(periods, fill_value: fill_value) end |
#shrink_to_fit(in_place: false) ⇒ DataFrame
Shrink DataFrame memory usage.
Shrinks to fit the exact capacity needed to hold the data.
5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 |
# File 'lib/polars/data_frame.rb', line 5911 def shrink_to_fit(in_place: false) if in_place _df.shrink_to_fit self else df = clone df._df.shrink_to_fit df end end |
#slice(offset, length = nil) ⇒ DataFrame
Get a slice of this DataFrame.
2191 2192 2193 2194 2195 2196 |
# File 'lib/polars/data_frame.rb', line 2191 def slice(offset, length = nil) if !length.nil? && length < 0 length = height - offset + length end _from_rbdf(_df.slice(offset, length)) end |
#sort(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column.
1863 1864 1865 1866 1867 |
# File 'lib/polars/data_frame.rb', line 1863 def sort(by, reverse: false, nulls_last: false) lazy .sort(by, reverse: reverse, nulls_last: nulls_last) .collect(no_optimization: true) end |
#sort!(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column in-place.
1879 1880 1881 |
# File 'lib/polars/data_frame.rb', line 1879 def sort!(by, reverse: false, nulls_last: false) self._df = sort(by, reverse: reverse, nulls_last: nulls_last)._df end |
#sql(query, table_name: "self") ⇒ DataFrame
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- The calling frame is automatically registered as a table in the SQL context
under the name "self". If you want access to the DataFrames and LazyFrames
found in the current globals, use the top-level :meth:
pl.sql <polars.sql>. - More control over registration and execution behaviour is available by
using the :class:
SQLContextobject. - The SQL query executes in lazy mode before being collected and returned as a DataFrame.
Execute a SQL query against the DataFrame.
1951 1952 1953 1954 1955 1956 |
# File 'lib/polars/data_frame.rb', line 1951 def sql(query, table_name: "self") ctx = SQLContext.new(eager_execution: true) name = table_name || "self" ctx.register(name, self) ctx.execute(query) end |
#std(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their standard deviation value.
5105 5106 5107 |
# File 'lib/polars/data_frame.rb', line 5105 def std(ddof: 1) lazy.std(ddof: ddof).collect(_eager: true) end |
#sum ⇒ DataFrame
Aggregate the columns of this DataFrame to their sum value.
4978 4979 4980 |
# File 'lib/polars/data_frame.rb', line 4978 def sum lazy.sum.collect(_eager: true) end |
#sum_horizontal(ignore_nulls: true) ⇒ Series
Sum all values horizontally across columns.
5006 5007 5008 5009 5010 |
# File 'lib/polars/data_frame.rb', line 5006 def sum_horizontal(ignore_nulls: true) select( sum: F.sum_horizontal(F.all, ignore_nulls: ignore_nulls) ).to_series end |
#tail(n = 5) ⇒ DataFrame
Get the last n rows.
2286 2287 2288 |
# File 'lib/polars/data_frame.rb', line 2286 def tail(n = 5) _from_rbdf(_df.tail(n)) end |
#to_a ⇒ Array
Returns an array representing the DataFrame
365 366 367 |
# File 'lib/polars/data_frame.rb', line 365 def to_a rows(named: true) end |
#to_csv(**options) ⇒ String
Write to comma-separated values (CSV) string.
887 888 889 |
# File 'lib/polars/data_frame.rb', line 887 def to_csv(**) write_csv(**) end |
#to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) ⇒ DataFrame
Get one hot encoded dummy variables.
5266 5267 5268 5269 5270 5271 |
# File 'lib/polars/data_frame.rb', line 5266 def to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) if columns.is_a?(::String) columns = [columns] end _from_rbdf(_df.to_dummies(columns, separator, drop_first, drop_nulls)) end |
#to_h(as_series: true) ⇒ Hash
Convert DataFrame to a hash mapping column name to values.
592 593 594 595 596 597 598 |
# File 'lib/polars/data_frame.rb', line 592 def to_h(as_series: true) if as_series get_columns.to_h { |s| [s.name, s] } else get_columns.to_h { |s| [s.name, s.to_a] } end end |
#to_hashes ⇒ Array
Convert every row to a hash.
609 610 611 |
# File 'lib/polars/data_frame.rb', line 609 def to_hashes rows(named: true) end |
#to_numo ⇒ Numo::NArray
Convert DataFrame to a 2D Numo array.
This operation clones data.
625 626 627 628 629 630 631 632 |
# File 'lib/polars/data_frame.rb', line 625 def to_numo out = _df.to_numo if out.nil? Numo::NArray.vstack(width.times.map { |i| to_series(i).to_numo }).transpose else out end end |
#to_s ⇒ String Also known as: inspect
Returns a string representing the DataFrame.
357 358 359 |
# File 'lib/polars/data_frame.rb', line 357 def to_s _df.to_s end |
#to_series(index = 0) ⇒ Series
Select column as Series at index location.
660 661 662 663 664 665 |
# File 'lib/polars/data_frame.rb', line 660 def to_series(index = 0) if index < 0 index = columns.length + index end Utils.wrap_s(_df.select_at_idx(index)) end |
#to_struct(name) ⇒ Series
Convert a DataFrame to a Series of type Struct.
6053 6054 6055 |
# File 'lib/polars/data_frame.rb', line 6053 def to_struct(name) Utils.wrap_s(_df.to_struct(name)) end |
#top_k(k, by:, reverse: false) ⇒ DataFrame
Return the k largest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse. The output is not guaranteed to be in any
particular order, call sort after this function if you wish the
output to be sorted.
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 |
# File 'lib/polars/data_frame.rb', line 2012 def top_k( k, by:, reverse: false ) lazy .top_k(k, by: by, reverse: reverse) .collect( # optimizations=QueryOptFlags( # projection_pushdown=False, # predicate_pushdown=False, # comm_subplan_elim=False, # slice_pushdown=True # ) ) end |
#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame
This is a very expensive operation. Perhaps you can do it differently.
Transpose a DataFrame over the diagonal.
1403 1404 1405 1406 |
# File 'lib/polars/data_frame.rb', line 1403 def transpose(include_header: false, header_name: "column", column_names: nil) keep_names_as = include_header ? header_name : nil _from_rbdf(_df.transpose(keep_names_as, column_names)) end |
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ DataFrame
Note that this fails if there is a column of type List in the DataFrame or
subset.
Drop duplicate rows from this DataFrame.
5311 5312 5313 5314 5315 5316 5317 5318 |
# File 'lib/polars/data_frame.rb', line 5311 def unique(maintain_order: true, subset: nil, keep: "first") self._from_rbdf( lazy .unique(maintain_order: maintain_order, subset: subset, keep: keep) .collect(no_optimization: true) ._df ) end |
#unnest(names) ⇒ DataFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame on the location of the
struct type.
6089 6090 6091 6092 6093 6094 |
# File 'lib/polars/data_frame.rb', line 6089 def unnest(names) if names.is_a?(::String) names = [names] end _from_rbdf(_df.unnest(names)) end |
#unpivot(on, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame Also known as: melt
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
4205 4206 4207 4208 4209 4210 |
# File 'lib/polars/data_frame.rb', line 4205 def unpivot(on, index: nil, variable_name: nil, value_name: nil) on = on.nil? ? [] : Utils.(self, on) index = index.nil? ? [] : Utils.(self, index) _from_rbdf(_df.unpivot(on, index, value_name, variable_name)) end |
#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame
This functionality is experimental and may be subject to changes without it being considered a breaking change.
Unstack a long table to a wide form without doing an aggregation.
This can be much faster than a pivot, because it can skip the grouping phase.
4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 |
# File 'lib/polars/data_frame.rb', line 4284 def unstack(step:, how: "vertical", columns: nil, fill_values: nil) if !columns.nil? df = select(columns) else df = self end height = df.height if how == "vertical" n_rows = step n_cols = (height / n_rows.to_f).ceil else n_cols = step n_rows = (height / n_cols.to_f).ceil end n_fill = n_cols * n_rows - height if n_fill > 0 if !fill_values.is_a?(::Array) fill_values = [fill_values] * df.width end df = df.select( df.get_columns.zip(fill_values).map do |s, next_fill| s.extend_constant(next_fill, n_fill) end ) end if how == "horizontal" df = ( df.with_column( (Polars.arange(0, n_cols * n_rows, eager: true) % n_cols).alias( "__sort_order" ) ) .sort("__sort_order") .drop("__sort_order") ) end zfill_val = Math.log10(n_cols).floor + 1 slices = df.get_columns.flat_map do |s| n_cols.times.map do |slice_nbr| s.slice(slice_nbr * n_rows, n_rows).alias("%s_%0#{zfill_val}d" % [s.name, slice_nbr]) end end _from_rbdf(DataFrame.new(slices)._df) end |
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ DataFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This is syntactic sugar for a left/inner join that preserves the order
of the left DataFrame by default, with an optional coalesce when
include_nulls: false.
Update the values in this DataFrame with the values in other.
6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 |
# File 'lib/polars/data_frame.rb', line 6272 def update( other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left" ) Utils.require_same_type(self, other) lazy .update( other.lazy, on: on, how: how, left_on: left_on, right_on: right_on, include_nulls: include_nulls, maintain_order: maintain_order ) .collect(_eager: true) end |
#upsample(time_column:, every:, by: nil, maintain_order: false) ⇒ DataFrame
Upsample a DataFrame at a regular frequency.
The every and offset arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 |
# File 'lib/polars/data_frame.rb', line 2930 def upsample( time_column:, every:, by: nil, maintain_order: false ) if by.nil? by = [] end if by.is_a?(::String) by = [by] end every = Utils.parse_as_duration_string(every) _from_rbdf( _df.upsample(by, time_column, every, maintain_order) ) end |
#var(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their variance value.
5146 5147 5148 |
# File 'lib/polars/data_frame.rb', line 5146 def var(ddof: 1) lazy.var(ddof: ddof).collect(_eager: true) end |
#vstack(df, in_place: false) ⇒ DataFrame
Grow this DataFrame vertically by stacking a DataFrame to it.
3548 3549 3550 3551 3552 3553 3554 3555 |
# File 'lib/polars/data_frame.rb', line 3548 def vstack(df, in_place: false) if in_place _df.vstack_mut(df._df) self else _from_rbdf(_df.vstack(df._df)) end end |
#width ⇒ Integer
Get the width of the DataFrame.
154 155 156 |
# File 'lib/polars/data_frame.rb', line 154 def width _df.width end |
#with_column(column) ⇒ DataFrame
Return a new DataFrame with the column added or replaced.
3463 3464 3465 3466 3467 |
# File 'lib/polars/data_frame.rb', line 3463 def with_column(column) lazy .with_column(column) .collect(no_optimization: true, string_cache: false) end |
#with_columns(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
4797 4798 4799 |
# File 'lib/polars/data_frame.rb', line 4797 def with_columns(*exprs, **named_exprs) lazy.with_columns(*exprs, **named_exprs).collect(_eager: true) end |
#with_columns_seq(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
4817 4818 4819 4820 4821 4822 4823 4824 |
# File 'lib/polars/data_frame.rb', line 4817 def with_columns_seq( *exprs, **named_exprs ) lazy .with_columns_seq(*exprs, **named_exprs) .collect(_eager: true) end |
#with_row_index(name: "index", offset: 0) ⇒ DataFrame Also known as: with_row_count
Add a column at index 0 that counts the rows.
2449 2450 2451 |
# File 'lib/polars/data_frame.rb', line 2449 def with_row_index(name: "index", offset: 0) _from_rbdf(_df.with_row_index(name, offset)) end |
#write_avro(file, compression = "uncompressed", name: "") ⇒ nil
Write to Apache Avro file.
901 902 903 904 905 906 907 908 909 910 911 912 913 |
# File 'lib/polars/data_frame.rb', line 901 def write_avro(file, compression = "uncompressed", name: "") if compression.nil? compression = "uncompressed" end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if name.nil? name = "" end _df.write_avro(file, compression, name) end |
#write_csv(file = nil, include_header: true, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil) ⇒ String?
Write to comma-separated values (CSV) file.
827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 |
# File 'lib/polars/data_frame.rb', line 827 def write_csv( file = nil, include_header: true, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil ) if sep.length > 1 raise ArgumentError, "only single byte separator is allowed" elsif quote.length > 1 raise ArgumentError, "only single byte quote char is allowed" elsif null_value == "" null_value = nil end if file.nil? buffer = StringIO.new buffer.set_encoding(Encoding::BINARY) _df.write_csv( buffer, include_header, sep.ord, quote.ord, batch_size, datetime_format, date_format, time_format, float_precision, null_value ) return buffer.string.force_encoding(Encoding::UTF_8) end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end _df.write_csv( file, include_header, sep.ord, quote.ord, batch_size, datetime_format, date_format, time_format, float_precision, null_value, ) nil end |
#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Write the data in a Polars DataFrame to a database.
1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 |
# File 'lib/polars/data_frame.rb', line 1104 def write_database(table_name, connection = nil, if_table_exists: "fail") if !defined?(ActiveRecord) raise Error, "Active Record not available" elsif ActiveRecord::VERSION::MAJOR < 7 raise Error, "Requires Active Record 7+" end valid_write_modes = ["append", "replace", "fail"] if !valid_write_modes.include?(if_table_exists) msg = "write_database `if_table_exists` must be one of #{valid_write_modes.inspect}, got #{if_table_exists.inspect}" raise ArgumentError, msg end with_connection(connection) do |connection| table_exists = connection.table_exists?(table_name) if table_exists && if_table_exists == "fail" raise ArgumentError, "Table already exists" end create_table = !table_exists || if_table_exists == "replace" maybe_transaction(connection, create_table) do if create_table mysql = connection.adapter_name.match?(/mysql|trilogy/i) force = if_table_exists == "replace" connection.create_table(table_name, id: false, force: force) do |t| schema.each do |c, dtype| = {} column_type = case dtype when Binary :binary when Boolean :boolean when Date :date when Datetime :datetime when Decimal if mysql [:precision] = dtype.precision || 65 [:scale] = dtype.scale || 30 end :decimal when Float32 [:limit] = 24 :float when Float64 [:limit] = 53 :float when Int8 [:limit] = 1 :integer when Int16 [:limit] = 2 :integer when Int32 [:limit] = 4 :integer when Int64 [:limit] = 8 :integer when UInt8 if mysql [:limit] = 1 [:unsigned] = true else [:limit] = 2 end :integer when UInt16 if mysql [:limit] = 2 [:unsigned] = true else [:limit] = 4 end :integer when UInt32 if mysql [:limit] = 4 [:unsigned] = true else [:limit] = 8 end :integer when UInt64 if mysql [:limit] = 8 [:unsigned] = true :integer else [:precision] = 20 [:scale] = 0 :decimal end when String :text when Time :time else raise ArgumentError, "column type not supported yet: #{dtype}" end t.column c, column_type, ** end end end quoted_table = connection.quote_table_name(table_name) quoted_columns = columns.map { |c| connection.quote_column_name(c) } rows = cast({Polars::UInt64 => Polars::String}).rows(named: false).map { |row| "(#{row.map { |v| connection.quote(v) }.join(", ")})" } connection.exec_update("INSERT INTO #{quoted_table} (#{quoted_columns.join(", ")}) VALUES #{rows.join(", ")}") end end end |
#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil
Write DataFrame as delta table.
1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 |
# File 'lib/polars/data_frame.rb', line 1267 def write_delta( target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil ) Polars.send(:_check_if_delta_available) if Utils.pathlike?(target) target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false) end data = self if mode == "merge" if .nil? msg = "You need to pass delta_merge_options with at least a given predicate for `MERGE` to work." raise ArgumentError, msg end if target.is_a?(::String) dt = DeltaLake::Table.new(target, storage_options: ) else dt = target end predicate = .delete(:predicate) dt.merge(data, predicate, **) else ||= {} DeltaLake.write( target, data, mode: mode, storage_options: , ** ) end end |
#write_iceberg(target, mode:) ⇒ nil
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
Write DataFrame to an Iceberg table.
1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 |
# File 'lib/polars/data_frame.rb', line 1234 def write_iceberg(target, mode:) require "iceberg" table = if target.is_a?(Iceberg::Table) target else raise Todo end data = self if mode == "append" table.append(data) else raise Todo end end |
#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2) ⇒ nil
Write to Arrow IPC binary stream or Feather file.
941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 |
# File 'lib/polars/data_frame.rb', line 941 def write_ipc( file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2 ) return_bytes = file.nil? if return_bytes file = StringIO.new file.set_encoding(Encoding::BINARY) end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if compat_level.nil? compat_level = true end if compression.nil? compression = "uncompressed" end if &.any? = .to_a else = nil end _df.write_ipc(file, compression, compat_level, , retries) return_bytes ? file.string : nil end |
#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object
Write to Arrow IPC record batch stream.
See "Streaming format" in https://arrow.apache.org/docs/python/ipc.html.
999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 |
# File 'lib/polars/data_frame.rb', line 999 def write_ipc_stream( file, compression: "uncompressed", compat_level: nil ) return_bytes = file.nil? if return_bytes file = StringIO.new file.set_encoding(Encoding::BINARY) elsif Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if compat_level.nil? compat_level = true end if compression.nil? compression = "uncompressed" end _df.write_ipc_stream(file, compression, compat_level) return_bytes ? file.string : nil end |
#write_json(file = nil) ⇒ nil
Serialize to JSON representation.
721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 |
# File 'lib/polars/data_frame.rb', line 721 def write_json(file = nil) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end to_string_io = !file.nil? && file.is_a?(StringIO) if file.nil? || to_string_io buf = StringIO.new buf.set_encoding(Encoding::BINARY) _df.write_json(buf) json_bytes = buf.string json_str = json_bytes.force_encoding(Encoding::UTF_8) if to_string_io file.write(json_str) else return json_str end else _df.write_json(file) end nil end |
#write_ndjson(file = nil) ⇒ nil
Serialize to newline delimited JSON representation.
760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 |
# File 'lib/polars/data_frame.rb', line 760 def write_ndjson(file = nil) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end to_string_io = !file.nil? && file.is_a?(StringIO) if file.nil? || to_string_io buf = StringIO.new buf.set_encoding(Encoding::BINARY) _df.write_ndjson(buf) json_bytes = buf.string json_str = json_bytes.force_encoding(Encoding::UTF_8) if to_string_io file.write(json_str) else return json_str end else _df.write_ndjson(file) end nil end |
#write_parquet(file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil) ⇒ nil
Write to Apache Parquet file.
1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 |
# File 'lib/polars/data_frame.rb', line 1048 def write_parquet( file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil ) if compression.nil? compression = "uncompressed" end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if statistics == true statistics = { min: true, max: true, distinct_count: false, null_count: true } elsif statistics == false statistics = {} elsif statistics == "full" statistics = { min: true, max: true, distinct_count: true, null_count: true } end _df.write_parquet( file, compression, compression_level, statistics, row_group_size, data_page_size ) end |