Class: Polars::DataFrame
- Inherits:
-
Object
- Object
- Polars::DataFrame
- Includes:
- Plot
- Defined in:
- lib/polars/data_frame.rb
Overview
Two-dimensional data structure representing data as a table with rows and columns.
Instance Method Summary collapse
-
#!=(other) ⇒ DataFrame
Not equal.
-
#%(other) ⇒ DataFrame
Returns the modulo.
-
#*(other) ⇒ DataFrame
Performs multiplication.
-
#+(other) ⇒ DataFrame
Performs addition.
-
#-(other) ⇒ DataFrame
Performs subtraction.
-
#/(other) ⇒ DataFrame
Performs division.
-
#<(other) ⇒ DataFrame
Less than.
-
#<=(other) ⇒ DataFrame
Less than or equal.
-
#==(other) ⇒ DataFrame
Equal.
-
#>(other) ⇒ DataFrame
Greater than.
-
#>=(other) ⇒ DataFrame
Greater than or equal.
-
#[](*args) ⇒ Object
Returns subset of the DataFrame.
-
#[]=(*key, value) ⇒ Object
Set item.
-
#bottom_k(k, by:, reverse: false) ⇒ DataFrame
Return the
k
smallest rows. -
#cast(dtypes, strict: true) ⇒ DataFrame
Cast DataFrame column(s) to the specified dtype(s).
-
#clear(n = 0) ⇒ DataFrame
(also: #cleared)
Create an empty copy of the current DataFrame.
-
#collect_schema ⇒ Schema
Get an ordered mapping of column names to their data type.
-
#columns ⇒ Array
Get column names.
-
#columns=(columns) ⇒ Object
Change the column names of the DataFrame.
-
#delete(name) ⇒ Series
Drop in place if exists.
-
#describe ⇒ DataFrame
Summary statistics for a DataFrame.
-
#drop(*columns) ⇒ DataFrame
Remove column from DataFrame and return as new.
-
#drop_in_place(name) ⇒ Series
Drop in place.
-
#drop_nans(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more NaN values.
-
#drop_nulls(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more null values.
-
#dtypes ⇒ Array
Get dtypes of columns in DataFrame.
-
#each(&block) ⇒ Object
Returns an enumerator.
-
#each_row(named: true, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
-
#equals(other, null_equal: true) ⇒ Boolean
(also: #frame_equal)
Check if DataFrame is equal to other.
-
#estimated_size(unit = "b") ⇒ Numeric
Return an estimation of the total (heap) allocated size of the DataFrame.
-
#explode(columns) ⇒ DataFrame
Explode
DataFrame
to long format by exploding a column with Lists. -
#extend(other) ⇒ DataFrame
Extend the memory backed by this
DataFrame
with the values fromother
. -
#fill_nan(fill_value) ⇒ DataFrame
Fill floating point NaN values by an Expression evaluation.
-
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame
Fill null values using the specified value or strategy.
-
#filter(predicate) ⇒ DataFrame
Filter the rows in the DataFrame based on a predicate expression.
-
#flags ⇒ Hash
Get flags that are set on the columns of this DataFrame.
-
#fold ⇒ Series
Apply a horizontal reduction on a DataFrame.
-
#gather_every(n, offset = 0) ⇒ DataFrame
(also: #take_every)
Take every nth row in the DataFrame and return as a new DataFrame.
-
#get_column(name) ⇒ Series
Get a single column as Series by name.
-
#get_column_index(name) ⇒ Series
(also: #find_idx_by_name)
Find the index of a column by name.
-
#get_columns ⇒ Array
Get the DataFrame as a Array of Series.
-
#group_by(by, maintain_order: false) ⇒ GroupBy
(also: #groupby, #group)
Start a group by operation.
-
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ DataFrame
(also: #groupby_dynamic)
Group based on a time value (or index value of type
:i32
,:i64
). -
#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series
Hash and combine the rows in this DataFrame.
-
#head(n = 5) ⇒ DataFrame
Get the first
n
rows. -
#height ⇒ Integer
(also: #count, #length, #size)
Get the height of the DataFrame.
-
#hstack(columns, in_place: false) ⇒ DataFrame
Return a new DataFrame grown horizontally by stacking multiple Series to it.
-
#include?(name) ⇒ Boolean
Check if DataFrame includes column.
-
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ DataFrame
constructor
Create a new DataFrame.
-
#insert_column(index, series) ⇒ DataFrame
(also: #insert_at_idx)
Insert a Series at a certain column index.
-
#interpolate ⇒ DataFrame
Interpolate intermediate values.
-
#is_duplicated ⇒ Series
Get a mask of all duplicated rows in this DataFrame.
-
#is_empty ⇒ Boolean
(also: #empty?)
Check if the dataframe is empty.
-
#is_unique ⇒ Series
Get a mask of all unique rows in this DataFrame.
-
#item ⇒ Object
Return the dataframe as a scalar.
-
#iter_columns ⇒ Object
Returns an iterator over the columns of this DataFrame.
-
#iter_rows(named: false, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
-
#iter_slices(n_rows: 10_000) ⇒ Object
Returns a non-copying iterator of slices over the underlying DataFrame.
-
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame
Join in SQL-like fashion.
-
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame
Perform an asof join.
-
#join_where(other, *predicates, suffix: "_right") ⇒ DataFrame
Perform a join based on one or multiple (in)equality predicates.
-
#lazy ⇒ LazyFrame
Start a lazy query from this point.
-
#limit(n = 5) ⇒ DataFrame
Get the first
n
rows. -
#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ Object
(also: #apply)
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
-
#max ⇒ DataFrame
Aggregate the columns of this DataFrame to their maximum value.
-
#max_horizontal ⇒ Series
Get the maximum value horizontally across columns.
-
#mean ⇒ DataFrame
Aggregate the columns of this DataFrame to their mean value.
-
#mean_horizontal(ignore_nulls: true) ⇒ Series
Take the mean of all values horizontally across columns.
-
#median ⇒ DataFrame
Aggregate the columns of this DataFrame to their median value.
-
#merge_sorted(other, key) ⇒ DataFrame
Take two sorted DataFrames and merge them by the sorted key.
-
#min ⇒ DataFrame
Aggregate the columns of this DataFrame to their minimum value.
-
#min_horizontal ⇒ Series
Get the minimum value horizontally across columns.
-
#n_chunks(strategy: "first") ⇒ Object
Get number of chunks used by the ChunkedArrays of this DataFrame.
-
#n_unique(subset: nil) ⇒ DataFrame
Return the number of unique rows, or the number of unique row-subsets.
-
#null_count ⇒ DataFrame
Create a new DataFrame that shows the null counts per column.
-
#partition_by(groups, maintain_order: true, include_key: true, as_dict: false) ⇒ Object
Split into multiple DataFrames partitioned by groups.
-
#pipe(func, *args, **kwargs, &block) ⇒ Object
Offers a structured way to apply a sequence of user-defined functions (UDFs).
-
#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
-
#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Vega::LiteChart
included
from Plot
Plot data.
-
#product ⇒ DataFrame
Aggregate the columns of this DataFrame to their product values.
-
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Aggregate the columns of this DataFrame to their quantile value.
-
#rechunk ⇒ DataFrame
This will make sure all subsequent operations have optimal and predictable performance.
-
#remove(*predicates, **constraints) ⇒ DataFrame
Remove rows, dropping those that match the given predicate expression(s).
-
#rename(mapping, strict: true) ⇒ DataFrame
Rename column names.
-
#replace(column, new_col) ⇒ DataFrame
Replace a column by a new Series.
-
#replace_column(index, series) ⇒ DataFrame
(also: #replace_at_idx)
Replace a column at an index location.
-
#reverse ⇒ DataFrame
Reverse the DataFrame.
-
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ RollingGroupBy
(also: #groupby_rolling, #group_by_rolling)
Create rolling groups based on a time column.
-
#row(index = nil, by_predicate: nil, named: false) ⇒ Object
Get a row as tuple, either by index or by predicate.
-
#rows(named: false) ⇒ Array
Convert columnar data to rows as Ruby arrays.
-
#rows_by_key(key, named: false, include_key: false, unique: false) ⇒ Hash
Convert columnar data to rows as Ruby arrays in a hash keyed by some column.
-
#sample(n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame
Sample from this DataFrame.
-
#schema ⇒ Hash
Get the schema.
-
#select(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
-
#select_seq(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
-
#set_sorted(column, descending: false) ⇒ DataFrame
Flag a column as sorted.
-
#shape ⇒ Array
Get the shape of the DataFrame.
-
#shift(n, fill_value: nil) ⇒ DataFrame
Shift values by the given period.
-
#shift_and_fill(periods, fill_value) ⇒ DataFrame
Shift the values by a given period and fill the resulting null values.
-
#shrink_to_fit(in_place: false) ⇒ DataFrame
Shrink DataFrame memory usage.
-
#slice(offset, length = nil) ⇒ DataFrame
Get a slice of this DataFrame.
-
#sort(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column.
-
#sort!(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column in-place.
-
#sql(query, table_name: "self") ⇒ DataFrame
Execute a SQL query against the DataFrame.
-
#std(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their standard deviation value.
-
#sum ⇒ DataFrame
Aggregate the columns of this DataFrame to their sum value.
-
#sum_horizontal(ignore_nulls: true) ⇒ Series
Sum all values horizontally across columns.
-
#tail(n = 5) ⇒ DataFrame
Get the last
n
rows. -
#to_a ⇒ Array
Returns an array representing the DataFrame.
-
#to_csv(**options) ⇒ String
Write to comma-separated values (CSV) string.
-
#to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) ⇒ DataFrame
Get one hot encoded dummy variables.
-
#to_h(as_series: true) ⇒ Hash
Convert DataFrame to a hash mapping column name to values.
-
#to_hashes ⇒ Array
Convert every row to a hash.
-
#to_numo ⇒ Numo::NArray
Convert DataFrame to a 2D Numo array.
-
#to_s ⇒ String
(also: #inspect)
Returns a string representing the DataFrame.
-
#to_series(index = 0) ⇒ Series
Select column as Series at index location.
-
#to_struct(name) ⇒ Series
Convert a
DataFrame
to aSeries
of typeStruct
. -
#top_k(k, by:, reverse: false) ⇒ DataFrame
Return the
k
largest rows. -
#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame
Transpose a DataFrame over the diagonal.
-
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ DataFrame
Drop duplicate rows from this DataFrame.
-
#unnest(names) ⇒ DataFrame
Decompose a struct into its fields.
-
#unpivot(on, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame
(also: #melt)
Unpivot a DataFrame from wide to long format.
-
#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame
Unstack a long table to a wide form without doing an aggregation.
-
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ DataFrame
Update the values in this
DataFrame
with the values inother
. -
#upsample(time_column:, every:, by: nil, maintain_order: false) ⇒ DataFrame
Upsample a DataFrame at a regular frequency.
-
#var(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their variance value.
-
#vstack(df, in_place: false) ⇒ DataFrame
Grow this DataFrame vertically by stacking a DataFrame to it.
-
#width ⇒ Integer
Get the width of the DataFrame.
-
#with_column(column) ⇒ DataFrame
Return a new DataFrame with the column added or replaced.
-
#with_columns(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
-
#with_columns_seq(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
-
#with_row_index(name: "index", offset: 0) ⇒ DataFrame
(also: #with_row_count)
Add a column at index 0 that counts the rows.
-
#write_avro(file, compression = "uncompressed", name: "") ⇒ nil
Write to Apache Avro file.
-
#write_csv(file = nil, include_header: true, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil) ⇒ String?
Write to comma-separated values (CSV) file.
-
#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer
Write the data in a Polars DataFrame to a database.
-
#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil
Write DataFrame as delta table.
-
#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2) ⇒ nil
Write to Arrow IPC binary stream or Feather file.
-
#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object
Write to Arrow IPC record batch stream.
-
#write_json(file = nil) ⇒ nil
Serialize to JSON representation.
-
#write_ndjson(file = nil) ⇒ nil
Serialize to newline delimited JSON representation.
-
#write_parquet(file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil) ⇒ nil
Write to Apache Parquet file.
Constructor Details
#initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) ⇒ DataFrame
Create a new DataFrame.
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# File 'lib/polars/data_frame.rb', line 50 def initialize(data = nil, schema: nil, schema_overrides: nil, strict: true, orient: nil, infer_schema_length: 100, nan_to_null: false) if defined?(ActiveRecord) && (data.is_a?(ActiveRecord::Relation) || data.is_a?(ActiveRecord::Result)) raise ArgumentError, "Use read_database instead" end if data.nil? self._df = self.class.hash_to_rbdf({}, schema: schema, schema_overrides: schema_overrides) elsif data.is_a?(Hash) data = data.transform_keys { |v| v.is_a?(Symbol) ? v.to_s : v } self._df = self.class.hash_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, nan_to_null: nan_to_null) elsif data.is_a?(::Array) self._df = self.class.sequence_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict, orient: orient, infer_schema_length: infer_schema_length) elsif data.is_a?(Series) self._df = self.class.series_to_rbdf(data, schema: schema, schema_overrides: schema_overrides, strict: strict) elsif data.respond_to?(:arrow_c_stream) # This uses the fact that RbSeries.from_arrow_c_stream will create a # struct-typed Series. Then we unpack that to a DataFrame. tmp_col_name = "" s = Utils.wrap_s(RbSeries.from_arrow_c_stream(data)) self._df = s.to_frame(tmp_col_name).unnest(tmp_col_name)._df else raise ArgumentError, "DataFrame constructor called with unsupported type; got #{data.class.name}" end end |
Instance Method Details
#!=(other) ⇒ DataFrame
Not equal.
225 226 227 |
# File 'lib/polars/data_frame.rb', line 225 def !=(other) _comp(other, "neq") end |
#%(other) ⇒ DataFrame
Returns the modulo.
308 309 310 311 312 313 314 315 |
# File 'lib/polars/data_frame.rb', line 308 def %(other) if other.is_a?(DataFrame) return _from_rbdf(_df.rem_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.rem(other._s)) end |
#*(other) ⇒ DataFrame
Performs multiplication.
260 261 262 263 264 265 266 267 |
# File 'lib/polars/data_frame.rb', line 260 def *(other) if other.is_a?(DataFrame) return _from_rbdf(_df.mul_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.mul(other._s)) end |
#+(other) ⇒ DataFrame
Performs addition.
284 285 286 287 288 289 290 291 |
# File 'lib/polars/data_frame.rb', line 284 def +(other) if other.is_a?(DataFrame) return _from_rbdf(_df.add_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.add(other._s)) end |
#-(other) ⇒ DataFrame
Performs subtraction.
296 297 298 299 300 301 302 303 |
# File 'lib/polars/data_frame.rb', line 296 def -(other) if other.is_a?(DataFrame) return _from_rbdf(_df.sub_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.sub(other._s)) end |
#/(other) ⇒ DataFrame
Performs division.
272 273 274 275 276 277 278 279 |
# File 'lib/polars/data_frame.rb', line 272 def /(other) if other.is_a?(DataFrame) return _from_rbdf(_df.div_df(other._df)) end other = _prepare_other_arg(other) _from_rbdf(_df.div(other._s)) end |
#<(other) ⇒ DataFrame
Less than.
239 240 241 |
# File 'lib/polars/data_frame.rb', line 239 def <(other) _comp(other, "lt") end |
#<=(other) ⇒ DataFrame
Less than or equal.
253 254 255 |
# File 'lib/polars/data_frame.rb', line 253 def <=(other) _comp(other, "lt_eq") end |
#==(other) ⇒ DataFrame
Equal.
218 219 220 |
# File 'lib/polars/data_frame.rb', line 218 def ==(other) _comp(other, "eq") end |
#>(other) ⇒ DataFrame
Greater than.
232 233 234 |
# File 'lib/polars/data_frame.rb', line 232 def >(other) _comp(other, "gt") end |
#>=(other) ⇒ DataFrame
Greater than or equal.
246 247 248 |
# File 'lib/polars/data_frame.rb', line 246 def >=(other) _comp(other, "gt_eq") end |
#[](*args) ⇒ Object
Returns subset of the DataFrame.
349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 |
# File 'lib/polars/data_frame.rb', line 349 def [](*args) if args.size == 2 row_selection, col_selection = args # df[.., unknown] if row_selection.is_a?(Range) # multiple slices # df[.., ..] if col_selection.is_a?(Range) raise Todo end end # df[2, ..] (select row as df) if row_selection.is_a?(Integer) if col_selection.is_a?(::Array) df = self[0.., col_selection] return df.slice(row_selection, 1) end # df[2, "a"] if col_selection.is_a?(::String) || col_selection.is_a?(Symbol) return self[col_selection][row_selection] end end # column selection can be "a" and ["a", "b"] if col_selection.is_a?(::String) || col_selection.is_a?(Symbol) col_selection = [col_selection] end # df[.., 1] if col_selection.is_a?(Integer) series = to_series(col_selection) return series[row_selection] end if col_selection.is_a?(::Array) # df[.., [1, 2]] if Utils.is_int_sequence(col_selection) series_list = col_selection.map { |i| to_series(i) } df = self.class.new(series_list) return df[row_selection] end end df = self[col_selection] return df[row_selection] elsif args.size == 1 item = args[0] # select single column # df["foo"] if item.is_a?(::String) || item.is_a?(Symbol) return Utils.wrap_s(_df.get_column(item.to_s)) end # df[idx] if item.is_a?(Integer) return slice(_pos_idx(item, 0), 1) end # df[..] if item.is_a?(Range) return Slice.new(self).apply(item) end if item.is_a?(::Array) && item.all? { |v| Utils.strlike?(v) } # select multiple columns # df[["foo", "bar"]] return _from_rbdf(_df.select(item.map(&:to_s))) end if Utils.is_int_sequence(item) item = Series.new("", item) end if item.is_a?(Series) dtype = item.dtype if dtype == String return _from_rbdf(_df.select(item)) elsif dtype == UInt32 return _from_rbdf(_df.take_with_series(item._s)) elsif [UInt8, UInt16, UInt64, Int8, Int16, Int32, Int64].include?(dtype) return _from_rbdf( _df.take_with_series(_pos_idxs(item, 0)._s) ) end end end # Ruby-specific if item.is_a?(Expr) || item.is_a?(Series) return filter(item) end raise ArgumentError, "Cannot get item of type: #{item.class.name}" end |
#[]=(*key, value) ⇒ Object
Set item.
451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 |
# File 'lib/polars/data_frame.rb', line 451 def []=(*key, value) if key.length == 1 key = key.first elsif key.length != 2 raise ArgumentError, "wrong number of arguments (given #{key.length + 1}, expected 2..3)" end if Utils.strlike?(key) if value.is_a?(::Array) || (defined?(Numo::NArray) && value.is_a?(Numo::NArray)) value = Series.new(value) elsif !value.is_a?(Series) value = Polars.lit(value) end self._df = with_column(value.alias(key.to_s))._df elsif key.is_a?(::Array) row_selection, col_selection = key if Utils.strlike?(col_selection) s = self[col_selection] elsif col_selection.is_a?(Integer) raise Todo else raise ArgumentError, "column selection not understood: #{col_selection}" end s[row_selection] = value if col_selection.is_a?(Integer) replace_column(col_selection, s) elsif Utils.strlike?(col_selection) replace(col_selection, s) end else raise Todo end end |
#bottom_k(k, by:, reverse: false) ⇒ DataFrame
Return the k
smallest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse
. The output is not guaranteed to be in any
particular order, call sort
after this function if you wish the
output to be sorted.
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 |
# File 'lib/polars/data_frame.rb', line 1981 def bottom_k( k, by:, reverse: false ) lazy .bottom_k(k, by: by, reverse: reverse) .collect( # optimizations=QueryOptFlags( # projection_pushdown=False, # predicate_pushdown=False, # comm_subplan_elim=False, # slice_pushdown=True, # ) ) end |
#cast(dtypes, strict: true) ⇒ DataFrame
Cast DataFrame column(s) to the specified dtype(s).
3668 3669 3670 |
# File 'lib/polars/data_frame.rb', line 3668 def cast(dtypes, strict: true) lazy.cast(dtypes, strict: strict).collect(_eager: true) end |
#clear(n = 0) ⇒ DataFrame Also known as: cleared
Create an empty copy of the current DataFrame.
Returns a DataFrame with identical schema but no data.
3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 |
# File 'lib/polars/data_frame.rb', line 3708 def clear(n = 0) if n == 0 _from_rbdf(_df.clear) elsif n > 0 || len > 0 self.class.new( schema.to_h { |nm, tp| [nm, Series.new(nm, [], dtype: tp).extend_constant(nil, n)] } ) else clone end end |
#collect_schema ⇒ Schema
This method is included to facilitate writing code that is generic for both DataFrame and LazyFrame.
Get an ordered mapping of column names to their data type.
528 529 530 |
# File 'lib/polars/data_frame.rb', line 528 def collect_schema Schema.new(columns.zip(dtypes), check_dtypes: false) end |
#columns ⇒ Array
Get column names.
135 136 137 |
# File 'lib/polars/data_frame.rb', line 135 def columns _df.columns end |
#columns=(columns) ⇒ Object
Change the column names of the DataFrame.
168 169 170 |
# File 'lib/polars/data_frame.rb', line 168 def columns=(columns) _df.set_column_names(columns) end |
#delete(name) ⇒ Series
Drop in place if exists.
3615 3616 3617 |
# File 'lib/polars/data_frame.rb', line 3615 def delete(name) drop_in_place(name) if include?(name) end |
#describe ⇒ DataFrame
Summary statistics for a DataFrame.
1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 |
# File 'lib/polars/data_frame.rb', line 1616 def describe describe_cast = lambda do |stat| columns = [] self.columns.each_with_index do |s, i| if self[s].is_numeric || self[s].is_boolean columns << stat[0.., i].cast(:f64) else # for dates, strings, etc, we cast to string so that all # statistics can be shown columns << stat[0.., i].cast(:str) end end self.class.new(columns) end summary = _from_rbdf( Polars.concat( [ describe_cast.( self.class.new(columns.to_h { |c| [c, [height]] }) ), describe_cast.(null_count), describe_cast.(mean), describe_cast.(std), describe_cast.(min), describe_cast.(max), describe_cast.(median) ] )._df ) summary.insert_column( 0, Polars::Series.new( "describe", ["count", "null_count", "mean", "std", "min", "max", "median"], ) ) summary end |
#drop(*columns) ⇒ DataFrame
Remove column from DataFrame and return as new.
3555 3556 3557 |
# File 'lib/polars/data_frame.rb', line 3555 def drop(*columns) lazy.drop(*columns).collect(_eager: true) end |
#drop_in_place(name) ⇒ Series
Drop in place.
3583 3584 3585 |
# File 'lib/polars/data_frame.rb', line 3583 def drop_in_place(name) Utils.wrap_s(_df.drop_in_place(name)) end |
#drop_nans(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
2230 2231 2232 |
# File 'lib/polars/data_frame.rb', line 2230 def drop_nans(subset: nil) lazy.drop_nans(subset: subset).collect(_eager: true) end |
#drop_nulls(subset: nil) ⇒ DataFrame
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
2275 2276 2277 |
# File 'lib/polars/data_frame.rb', line 2275 def drop_nulls(subset: nil) lazy.drop_nulls(subset: subset).collect(_eager: true) end |
#dtypes ⇒ Array
Get dtypes of columns in DataFrame. Dtypes can also be found in column headers when printing the DataFrame.
186 187 188 |
# File 'lib/polars/data_frame.rb', line 186 def dtypes _df.dtypes end |
#each(&block) ⇒ Object
Returns an enumerator.
342 343 344 |
# File 'lib/polars/data_frame.rb', line 342 def each(&block) get_columns.each(&block) end |
#each_row(named: true, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
5716 5717 5718 |
# File 'lib/polars/data_frame.rb', line 5716 def each_row(named: true, buffer_size: 500, &block) iter_rows(named: named, buffer_size: buffer_size, &block) end |
#equals(other, null_equal: true) ⇒ Boolean Also known as: frame_equal
Check if DataFrame is equal to other.
2026 2027 2028 |
# File 'lib/polars/data_frame.rb', line 2026 def equals(other, null_equal: true) _df.equals(other._df, null_equal) end |
#estimated_size(unit = "b") ⇒ Numeric
Return an estimation of the total (heap) allocated size of the DataFrame.
Estimated size is given in the specified unit (bytes by default).
This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, StructArray's size is an upper bound.
When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.
FFI buffers are included in this estimation.
1239 1240 1241 1242 |
# File 'lib/polars/data_frame.rb', line 1239 def estimated_size(unit = "b") sz = _df.estimated_size Utils.scale_bytes(sz, to: unit) end |
#explode(columns) ⇒ DataFrame
Explode DataFrame
to long format by exploding a column with Lists.
3957 3958 3959 |
# File 'lib/polars/data_frame.rb', line 3957 def explode(columns) lazy.explode(columns).collect(no_optimization: true) end |
#extend(other) ⇒ DataFrame
Extend the memory backed by this DataFrame
with the values from other
.
Different from vstack
which adds the chunks from other
to the chunks of this
DataFrame
extend
appends the data from other
to the underlying memory
locations and thus may cause a reallocation.
If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.
Prefer extend
over vstack
when you want to do a query after a single append.
For instance during online operations where you add n
rows and rerun a query.
Prefer vstack
over extend
when you want to append many times before doing a
query. For instance when you read in multiple files and when to store them in a
single DataFrame
. In the latter case, finish the sequence of vstack
operations with a rechunk
.
3495 3496 3497 3498 |
# File 'lib/polars/data_frame.rb', line 3495 def extend(other) _df.extend(other._df) self end |
#fill_nan(fill_value) ⇒ DataFrame
Note that floating point NaNs (Not a Number) are not missing values!
To replace missing values, use fill_null
.
Fill floating point NaN values by an Expression evaluation.
3922 3923 3924 |
# File 'lib/polars/data_frame.rb', line 3922 def fill_nan(fill_value) lazy.fill_nan(fill_value).collect(no_optimization: true) end |
#fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) ⇒ DataFrame
Fill null values using the specified value or strategy.
3882 3883 3884 3885 3886 3887 3888 3889 |
# File 'lib/polars/data_frame.rb', line 3882 def fill_null(value = nil, strategy: nil, limit: nil, matches_supertype: true) _from_rbdf( lazy .fill_null(value, strategy: strategy, limit: limit, matches_supertype: matches_supertype) .collect(no_optimization: true) ._df ) end |
#filter(predicate) ⇒ DataFrame
Filter the rows in the DataFrame based on a predicate expression.
1462 1463 1464 |
# File 'lib/polars/data_frame.rb', line 1462 def filter(predicate) lazy.filter(predicate).collect end |
#flags ⇒ Hash
Get flags that are set on the columns of this DataFrame.
193 194 195 |
# File 'lib/polars/data_frame.rb', line 193 def flags columns.to_h { |name| [name, self[name].flags] } end |
#fold ⇒ Series
Apply a horizontal reduction on a DataFrame.
This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).
An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:
i8 + str = str f32 + i64 = f32 f32 + f64 = f64
5446 5447 5448 5449 5450 5451 5452 5453 |
# File 'lib/polars/data_frame.rb', line 5446 def fold acc = to_series(0) 1.upto(width - 1) do |i| acc = yield(acc, to_series(i)) end acc end |
#gather_every(n, offset = 0) ⇒ DataFrame Also known as: take_every
Take every nth row in the DataFrame and return as a new DataFrame.
5837 5838 5839 |
# File 'lib/polars/data_frame.rb', line 5837 def gather_every(n, offset = 0) select(F.col("*").gather_every(n, offset)) end |
#get_column(name) ⇒ Series
Get a single column as Series by name.
3799 3800 3801 |
# File 'lib/polars/data_frame.rb', line 3799 def get_column(name) self[name] end |
#get_column_index(name) ⇒ Series Also known as: find_idx_by_name
Find the index of a column by name.
1669 1670 1671 |
# File 'lib/polars/data_frame.rb', line 1669 def get_column_index(name) _df.get_column_index(name) end |
#get_columns ⇒ Array
Get the DataFrame as a Array of Series.
3777 3778 3779 |
# File 'lib/polars/data_frame.rb', line 3777 def get_columns _df.get_columns.map { |s| Utils.wrap_s(s) } end |
#group_by(by, maintain_order: false) ⇒ GroupBy Also known as: groupby, group
Start a group by operation.
2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 |
# File 'lib/polars/data_frame.rb', line 2383 def group_by(by, maintain_order: false) if !Utils.bool?(maintain_order) raise TypeError, "invalid input for group_by arg `maintain_order`: #{maintain_order}." end GroupBy.new( self, by, maintain_order: maintain_order ) end |
#group_by_dynamic(index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window") ⇒ DataFrame Also known as: groupby_dynamic
Group based on a time value (or index value of type :i32
, :i64
).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
- every: interval of the window
- period: length of the window
- offset: offset of the window
The every
, period
and offset
arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_dynamic on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 |
# File 'lib/polars/data_frame.rb', line 2739 def group_by_dynamic( index_column, every:, period: nil, offset: nil, truncate: true, include_boundaries: false, closed: "left", by: nil, start_by: "window" ) DynamicGroupBy.new( self, index_column, every, period, offset, truncate, include_boundaries, closed, by, start_by ) end |
#hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) ⇒ Series
Hash and combine the rows in this DataFrame.
The hash value is of type :u64
.
5874 5875 5876 5877 5878 5879 5880 |
# File 'lib/polars/data_frame.rb', line 5874 def hash_rows(seed: 0, seed_1: nil, seed_2: nil, seed_3: nil) k0 = seed k1 = seed_1.nil? ? seed : seed_1 k2 = seed_2.nil? ? seed : seed_2 k3 = seed_3.nil? ? seed : seed_3 Utils.wrap_s(_df.hash_rows(k0, k1, k2, k3)) end |
#head(n = 5) ⇒ DataFrame
Get the first n
rows.
2153 2154 2155 |
# File 'lib/polars/data_frame.rb', line 2153 def head(n = 5) _from_rbdf(_df.head(n)) end |
#height ⇒ Integer Also known as: count, length, size
Get the height of the DataFrame.
102 103 104 |
# File 'lib/polars/data_frame.rb', line 102 def height _df.height end |
#hstack(columns, in_place: false) ⇒ DataFrame
Return a new DataFrame grown horizontally by stacking multiple Series to it.
3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 |
# File 'lib/polars/data_frame.rb', line 3397 def hstack(columns, in_place: false) if !columns.is_a?(::Array) columns = columns.get_columns end if in_place _df.hstack_mut(columns.map(&:_s)) self else _from_rbdf(_df.hstack(columns.map(&:_s))) end end |
#include?(name) ⇒ Boolean
Check if DataFrame includes column.
335 336 337 |
# File 'lib/polars/data_frame.rb', line 335 def include?(name) columns.include?(name) end |
#insert_column(index, series) ⇒ DataFrame Also known as: insert_at_idx
Insert a Series at a certain column index. This operation is in place.
1415 1416 1417 1418 1419 1420 1421 |
# File 'lib/polars/data_frame.rb', line 1415 def insert_column(index, series) if index < 0 index = columns.length + index end _df.insert_column(index, series._s) self end |
#interpolate ⇒ DataFrame
Interpolate intermediate values. The interpolation method is linear.
5907 5908 5909 |
# File 'lib/polars/data_frame.rb', line 5907 def interpolate select(F.col("*").interpolate) end |
#is_duplicated ⇒ Series
Get a mask of all duplicated rows in this DataFrame.
4439 4440 4441 |
# File 'lib/polars/data_frame.rb', line 4439 def is_duplicated Utils.wrap_s(_df.is_duplicated) end |
#is_empty ⇒ Boolean Also known as: empty?
Check if the dataframe is empty.
5921 5922 5923 |
# File 'lib/polars/data_frame.rb', line 5921 def is_empty height == 0 end |
#is_unique ⇒ Series
Get a mask of all unique rows in this DataFrame.
4464 4465 4466 |
# File 'lib/polars/data_frame.rb', line 4464 def is_unique Utils.wrap_s(_df.is_unique) end |
#item ⇒ Object
Return the dataframe as a scalar.
Equivalent to df[0,0]
, with a check that the shape is (1,1).
543 544 545 546 547 548 |
# File 'lib/polars/data_frame.rb', line 543 def item if shape != [1, 1] raise ArgumentError, "Can only call .item if the dataframe is of shape (1,1), dataframe is of shape #{shape}" end self[0, 0] end |
#iter_columns ⇒ Object
Consider whether you can use all
instead.
If you can, it will be more efficient.
Returns an iterator over the columns of this DataFrame.
5766 5767 5768 5769 5770 5771 5772 |
# File 'lib/polars/data_frame.rb', line 5766 def iter_columns return to_enum(:iter_columns) unless block_given? _df.get_columns.each do |s| yield Utils.wrap_s(s) end end |
#iter_rows(named: false, buffer_size: 500, &block) ⇒ Object
Returns an iterator over the DataFrame of rows of Ruby-native values.
5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 |
# File 'lib/polars/data_frame.rb', line 5669 def iter_rows(named: false, buffer_size: 500, &block) return to_enum(:iter_rows, named: named, buffer_size: buffer_size) unless block_given? # load into the local namespace for a modest performance boost in the hot loops columns = self.columns # note: buffering rows results in a 2-4x speedup over individual calls # to ".row(i)", so it should only be disabled in extremely specific cases. if buffer_size offset = 0 while offset < height zerocopy_slice = slice(offset, buffer_size) rows_chunk = zerocopy_slice.rows(named: false) if named rows_chunk.each do |row| yield columns.zip(row).to_h end else rows_chunk.each(&block) end offset += buffer_size end elsif named height.times do |i| yield columns.zip(row(i)).to_h end else height.times do |i| yield row(i) end end end |
#iter_slices(n_rows: 10_000) ⇒ Object
Returns a non-copying iterator of slices over the underlying DataFrame.
5794 5795 5796 5797 5798 5799 5800 5801 5802 |
# File 'lib/polars/data_frame.rb', line 5794 def iter_slices(n_rows: 10_000) return to_enum(:iter_slices, n_rows: n_rows) unless block_given? offset = 0 while offset < height yield slice(offset, n_rows) offset += n_rows end end |
#join(other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil) ⇒ DataFrame
Join in SQL-like fashion.
3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 |
# File 'lib/polars/data_frame.rb', line 3128 def join( other, left_on: nil, right_on: nil, on: nil, how: "inner", suffix: "_right", validate: "m:m", join_nulls: false, coalesce: nil, maintain_order: nil ) lazy .join( other.lazy, left_on: left_on, right_on: right_on, on: on, how: how, suffix: suffix, validate: validate, join_nulls: join_nulls, coalesce: coalesce, maintain_order: maintain_order ) .collect(no_optimization: true) end |
#join_asof(other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true) ⇒ DataFrame
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
- A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
- A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
The default is "backward".
2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 |
# File 'lib/polars/data_frame.rb', line 2962 def join_asof( other, left_on: nil, right_on: nil, on: nil, by_left: nil, by_right: nil, by: nil, strategy: "backward", suffix: "_right", tolerance: nil, allow_parallel: true, force_parallel: false, coalesce: true, allow_exact_matches: true, check_sortedness: true ) lazy .join_asof( other.lazy, left_on: left_on, right_on: right_on, on: on, by_left: by_left, by_right: by_right, by: by, strategy: strategy, suffix: suffix, tolerance: tolerance, allow_parallel: allow_parallel, force_parallel: force_parallel, coalesce: coalesce, allow_exact_matches: allow_exact_matches, check_sortedness: check_sortedness ) .collect(no_optimization: true) end |
#join_where(other, *predicates, suffix: "_right") ⇒ DataFrame
The row order of the input DataFrames is not preserved.
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 |
# File 'lib/polars/data_frame.rb', line 3235 def join_where( other, *predicates, suffix: "_right" ) Utils.require_same_type(self, other) lazy .join_where( other.lazy, *predicates, suffix: suffix ) .collect(_eager: true) end |
#lazy ⇒ LazyFrame
Start a lazy query from this point.
4471 4472 4473 |
# File 'lib/polars/data_frame.rb', line 4471 def lazy wrap_ldf(_df.lazy) end |
#limit(n = 5) ⇒ DataFrame
Get the first n
rows.
Alias for #head.
2122 2123 2124 |
# File 'lib/polars/data_frame.rb', line 2122 def limit(n = 5) head(n) end |
#map_rows(return_dtype: nil, inference_size: 256, &f) ⇒ Object Also known as: apply
The frame-level apply
cannot track column names (as the UDF is a black-box
that may arbitrarily drop, rearrange, transform, or add new columns); if you
want to apply a UDF such that column names are preserved, you should use the
expression-level apply
syntax instead.
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
The UDF will receive each row as a tuple of values: udf(row)
.
Implementing logic using a Ruby function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:
- The native expression engine runs in Rust; UDFs run in Ruby.
- Use of Ruby UDFs forces the DataFrame to be materialized in memory.
- Polars-native expressions can be parallelised (UDFs cannot).
- Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
3311 3312 3313 3314 3315 3316 3317 3318 |
# File 'lib/polars/data_frame.rb', line 3311 def map_rows(return_dtype: nil, inference_size: 256, &f) out, is_df = _df.map_rows(f, return_dtype, inference_size) if is_df _from_rbdf(out) else _from_rbdf(Utils.wrap_s(out).to_frame._df) end end |
#max ⇒ DataFrame
Aggregate the columns of this DataFrame to their maximum value.
4776 4777 4778 |
# File 'lib/polars/data_frame.rb', line 4776 def max lazy.max.collect(_eager: true) end |
#max_horizontal ⇒ Series
Get the maximum value horizontally across columns.
4800 4801 4802 |
# File 'lib/polars/data_frame.rb', line 4800 def max_horizontal select(max: F.max_horizontal(F.all)).to_series end |
#mean ⇒ DataFrame
Aggregate the columns of this DataFrame to their mean value.
4932 4933 4934 |
# File 'lib/polars/data_frame.rb', line 4932 def mean lazy.mean.collect(_eager: true) end |
#mean_horizontal(ignore_nulls: true) ⇒ Series
Take the mean of all values horizontally across columns.
4960 4961 4962 4963 4964 |
# File 'lib/polars/data_frame.rb', line 4960 def mean_horizontal(ignore_nulls: true) select( mean: F.mean_horizontal(F.all, ignore_nulls: ignore_nulls) ).to_series end |
#median ⇒ DataFrame
Aggregate the columns of this DataFrame to their median value.
5070 5071 5072 |
# File 'lib/polars/data_frame.rb', line 5070 def median lazy.median.collect(_eager: true) end |
#merge_sorted(other, key) ⇒ DataFrame
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both DataFrames must be equal.
6036 6037 6038 |
# File 'lib/polars/data_frame.rb', line 6036 def merge_sorted(other, key) lazy.merge_sorted(other.lazy, key).collect(_eager: true) end |
#min ⇒ DataFrame
Aggregate the columns of this DataFrame to their minimum value.
4826 4827 4828 |
# File 'lib/polars/data_frame.rb', line 4826 def min lazy.min.collect(_eager: true) end |
#min_horizontal ⇒ Series
Get the minimum value horizontally across columns.
4850 4851 4852 |
# File 'lib/polars/data_frame.rb', line 4850 def min_horizontal select(min: F.min_horizontal(F.all)).to_series end |
#n_chunks(strategy: "first") ⇒ Object
Get number of chunks used by the ChunkedArrays of this DataFrame.
4744 4745 4746 4747 4748 4749 4750 4751 4752 |
# File 'lib/polars/data_frame.rb', line 4744 def n_chunks(strategy: "first") if strategy == "first" _df.n_chunks elsif strategy == "all" get_columns.map(&:n_chunks) else raise ArgumentError, "Strategy: '{strategy}' not understood. Choose one of {{'first', 'all'}}" end end |
#n_unique(subset: nil) ⇒ DataFrame
Return the number of unique rows, or the number of unique row-subsets.
5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 |
# File 'lib/polars/data_frame.rb', line 5249 def n_unique(subset: nil) if subset.is_a?(StringIO) subset = [Polars.col(subset)] elsif subset.is_a?(Expr) subset = [subset] end if subset.is_a?(::Array) && subset.length == 1 expr = Utils.wrap_expr(Utils.parse_into_expression(subset[0], str_as_lit: false)) else struct_fields = subset.nil? ? Polars.all : subset expr = Polars.struct(struct_fields) end df = lazy.select(expr.n_unique).collect df.is_empty ? 0 : df.row(0)[0] end |
#null_count ⇒ DataFrame
Create a new DataFrame that shows the null counts per column.
5299 5300 5301 |
# File 'lib/polars/data_frame.rb', line 5299 def null_count _from_rbdf(_df.null_count) end |
#partition_by(groups, maintain_order: true, include_key: true, as_dict: false) ⇒ Object
Split into multiple DataFrames partitioned by groups.
4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 |
# File 'lib/polars/data_frame.rb', line 4312 def partition_by(groups, maintain_order: true, include_key: true, as_dict: false) if groups.is_a?(::String) groups = [groups] elsif !groups.is_a?(::Array) groups = Array(groups) end if as_dict out = {} if groups.length == 1 _df.partition_by(groups, maintain_order, include_key).each do |df| df = _from_rbdf(df) out[df[groups][0, 0]] = df end else _df.partition_by(groups, maintain_order, include_key).each do |df| df = _from_rbdf(df) out[df[groups].row(0)] = df end end out else _df.partition_by(groups, maintain_order, include_key).map { |df| _from_rbdf(df) } end end |
#pipe(func, *args, **kwargs, &block) ⇒ Object
It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See #lazy.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
2315 2316 2317 |
# File 'lib/polars/data_frame.rb', line 2315 def pipe(func, *args, **kwargs, &block) func.call(self, *args, **kwargs, &block) end |
#pivot(on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_") ⇒ DataFrame
Create a spreadsheet-style pivot table as a DataFrame.
4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 |
# File 'lib/polars/data_frame.rb', line 4001 def pivot( on, index: nil, values: nil, aggregate_function: nil, maintain_order: true, sort_columns: false, separator: "_" ) index = Utils.(self, index) on = Utils.(self, on) if !values.nil? values = Utils.(self, values) end if aggregate_function.is_a?(::String) case aggregate_function when "first" aggregate_expr = F.element.first._rbexpr when "sum" aggregate_expr = F.element.sum._rbexpr when "max" aggregate_expr = F.element.max._rbexpr when "min" aggregate_expr = F.element.min._rbexpr when "mean" aggregate_expr = F.element.mean._rbexpr when "median" aggregate_expr = F.element.median._rbexpr when "last" aggregate_expr = F.element.last._rbexpr when "len" aggregate_expr = F.len._rbexpr when "count" warn "`aggregate_function: \"count\"` input for `pivot` is deprecated. Use `aggregate_function: \"len\"` instead." aggregate_expr = F.len._rbexpr else raise ArgumentError, "Argument aggregate fn: '#{aggregate_fn}' was not expected." end elsif aggregate_function.nil? aggregate_expr = nil else aggregate_expr = aggregate_function._rbexpr end _from_rbdf( _df.pivot_expr( on, index, values, maintain_order, sort_columns, aggregate_expr, separator ) ) end |
#plot(x = nil, y = nil, type: nil, group: nil, stacked: nil) ⇒ Vega::LiteChart Originally defined in module Plot
Plot data.
#product ⇒ DataFrame
Aggregate the columns of this DataFrame to their product values.
5096 5097 5098 |
# File 'lib/polars/data_frame.rb', line 5096 def product select(Polars.all.product) end |
#quantile(quantile, interpolation: "nearest") ⇒ DataFrame
Aggregate the columns of this DataFrame to their quantile value.
5127 5128 5129 |
# File 'lib/polars/data_frame.rb', line 5127 def quantile(quantile, interpolation: "nearest") lazy.quantile(quantile, interpolation: interpolation).collect(_eager: true) end |
#rechunk ⇒ DataFrame
This will make sure all subsequent operations have optimal and predictable performance.
5273 5274 5275 |
# File 'lib/polars/data_frame.rb', line 5273 def rechunk _from_rbdf(_df.rechunk) end |
#remove(*predicates, **constraints) ⇒ DataFrame
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to True are retained
(this includes rows where the predicate evaluates as null
).
1577 1578 1579 1580 1581 1582 1583 1584 |
# File 'lib/polars/data_frame.rb', line 1577 def remove( *predicates, **constraints ) lazy .remove(*predicates, **constraints) .collect(_eager: true) end |
#rename(mapping, strict: true) ⇒ DataFrame
Rename column names.
1364 1365 1366 |
# File 'lib/polars/data_frame.rb', line 1364 def rename(mapping, strict: true) lazy.rename(mapping, strict: strict).collect(no_optimization: true) end |
#replace(column, new_col) ⇒ DataFrame
Replace a column by a new Series.
2055 2056 2057 2058 |
# File 'lib/polars/data_frame.rb', line 2055 def replace(column, new_col) _df.replace(column.to_s, new_col._s) self end |
#replace_column(index, series) ⇒ DataFrame Also known as: replace_at_idx
Replace a column at an index location.
1704 1705 1706 1707 1708 1709 1710 |
# File 'lib/polars/data_frame.rb', line 1704 def replace_column(index, series) if index < 0 index = columns.length + index end _df.replace_column(index, series._s) self end |
#reverse ⇒ DataFrame
Reverse the DataFrame.
1329 1330 1331 |
# File 'lib/polars/data_frame.rb', line 1329 def reverse select(Polars.col("*").reverse) end |
#rolling(index_column:, period:, offset: nil, closed: "right", by: nil) ⇒ RollingGroupBy Also known as: groupby_rolling, group_by_rolling
Create rolling groups based on a time column.
Also works for index values of type :i32
or :i64
.
Different from a dynamic_group_by
the windows are now determined by the
individual values and are not of constant intervals. For constant intervals use
group_by_dynamic
The period
and offset
arguments are created either from a timedelta, or
by using the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a group_by_rolling on an integer column, the windows are defined by:
- "1i" # length 1
- "10i" # length 10
2480 2481 2482 2483 2484 2485 2486 2487 2488 |
# File 'lib/polars/data_frame.rb', line 2480 def rolling( index_column:, period:, offset: nil, closed: "right", by: nil ) RollingGroupBy.new(self, index_column, period, offset, closed, by) end |
#row(index = nil, by_predicate: nil, named: false) ⇒ Object
The index
and by_predicate
params are mutually exclusive. Additionally,
to ensure clarity, the by_predicate
parameter must be supplied by keyword.
When using by_predicate
it is an error condition if anything other than
one row is returned; more than one row raises TooManyRowsReturned
, and
zero rows will raise NoRowsReturned
(both inherit from RowsException
).
Get a row as tuple, either by index or by predicate.
5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 |
# File 'lib/polars/data_frame.rb', line 5494 def row(index = nil, by_predicate: nil, named: false) if !index.nil? && !by_predicate.nil? raise ArgumentError, "Cannot set both 'index' and 'by_predicate'; mutually exclusive" elsif index.is_a?(Expr) raise TypeError, "Expressions should be passed to the 'by_predicate' param" end if !index.nil? row = _df.row_tuple(index) if named columns.zip(row).to_h else row end elsif !by_predicate.nil? if !by_predicate.is_a?(Expr) raise TypeError, "Expected by_predicate to be an expression; found #{by_predicate.class.name}" end rows = filter(by_predicate).rows n_rows = rows.length if n_rows > 1 raise TooManyRowsReturned, "Predicate #{by_predicate} returned #{n_rows} rows" elsif n_rows == 0 raise NoRowsReturned, "Predicate #{by_predicate} returned no rows" end row = rows[0] if named columns.zip(row).to_h else row end else raise ArgumentError, "One of 'index' or 'by_predicate' must be set" end end |
#rows(named: false) ⇒ Array
Convert columnar data to rows as Ruby arrays.
5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 |
# File 'lib/polars/data_frame.rb', line 5551 def rows(named: false) if named columns = self.columns _df.row_tuples.map do |v| columns.zip(v).to_h end else _df.row_tuples end end |
#rows_by_key(key, named: false, include_key: false, unique: false) ⇒ Hash
Convert columnar data to rows as Ruby arrays in a hash keyed by some column.
This method is like rows
, but instead of returning rows in a flat list, rows
are grouped by the values in the key
column(s) and returned as a hash.
Note that this method should not be used in place of native operations, due to the high cost of materializing all frame data out into a hash; it should be used only when you need to move the values out into a Ruby data structure or other object that cannot operate directly with Polars/Arrow.
5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 |
# File 'lib/polars/data_frame.rb', line 5618 def rows_by_key(key, named: false, include_key: false, unique: false) key = Utils.(self, key) keys = key.size == 1 ? get_column(key[0]) : select(key).iter_rows if include_key values = self else data_cols = schema.keys - key values = select(data_cols) end zipped = keys.each.zip(values.iter_rows(named: named)) # if unique, we expect to write just one entry per key; otherwise, we're # returning a list of rows for each key, so append into a hash of arrays. if unique zipped.to_h else zipped.each_with_object({}) { |(key, data), h| (h[key] ||= []) << data } end end |
#sample(n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil) ⇒ DataFrame
Sample from this DataFrame.
5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 |
# File 'lib/polars/data_frame.rb', line 5339 def sample( n: nil, frac: nil, with_replacement: false, shuffle: false, seed: nil ) if !n.nil? && !frac.nil? raise ArgumentError, "cannot specify both `n` and `frac`" end if n.nil? && !frac.nil? frac = Series.new("frac", [frac]) unless frac.is_a?(Series) return _from_rbdf( _df.sample_frac(frac._s, with_replacement, shuffle, seed) ) end if n.nil? n = 1 end n = Series.new("", [n]) unless n.is_a?(Series) _from_rbdf(_df.sample_n(n._s, with_replacement, shuffle, seed)) end |
#schema ⇒ Hash
Get the schema.
211 212 213 |
# File 'lib/polars/data_frame.rb', line 211 def schema columns.zip(dtypes).to_h end |
#select(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
4563 4564 4565 |
# File 'lib/polars/data_frame.rb', line 4563 def select(*exprs, **named_exprs) lazy.select(*exprs, **named_exprs).collect(_eager: true) end |
#select_seq(*exprs, **named_exprs) ⇒ DataFrame
Select columns from this DataFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
4581 4582 4583 4584 4585 |
# File 'lib/polars/data_frame.rb', line 4581 def select_seq(*exprs, **named_exprs) lazy .select_seq(*exprs, **named_exprs) .collect(_eager: true) end |
#set_sorted(column, descending: false) ⇒ DataFrame
This can lead to incorrect results if the data is NOT sorted! Use with care!
Flag a column as sorted.
This can speed up future operations.
6053 6054 6055 6056 6057 6058 6059 6060 |
# File 'lib/polars/data_frame.rb', line 6053 def set_sorted( column, descending: false ) lazy .set_sorted(column, descending: descending) .collect(no_optimization: true) end |
#shape ⇒ Array
Get the shape of the DataFrame.
90 91 92 |
# File 'lib/polars/data_frame.rb', line 90 def shape _df.shape end |
#shift(n, fill_value: nil) ⇒ DataFrame
Shift values by the given period.
4381 4382 4383 |
# File 'lib/polars/data_frame.rb', line 4381 def shift(n, fill_value: nil) lazy.shift(n, fill_value: fill_value).collect(_eager: true) end |
#shift_and_fill(periods, fill_value) ⇒ DataFrame
Shift the values by a given period and fill the resulting null values.
4414 4415 4416 |
# File 'lib/polars/data_frame.rb', line 4414 def shift_and_fill(periods, fill_value) shift(periods, fill_value: fill_value) end |
#shrink_to_fit(in_place: false) ⇒ DataFrame
Shrink DataFrame memory usage.
Shrinks to fit the exact capacity needed to hold the data.
5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 |
# File 'lib/polars/data_frame.rb', line 5809 def shrink_to_fit(in_place: false) if in_place _df.shrink_to_fit self else df = clone df._df.shrink_to_fit df end end |
#slice(offset, length = nil) ⇒ DataFrame
Get a slice of this DataFrame.
2089 2090 2091 2092 2093 2094 |
# File 'lib/polars/data_frame.rb', line 2089 def slice(offset, length = nil) if !length.nil? && length < 0 length = height - offset + length end _from_rbdf(_df.slice(offset, length)) end |
#sort(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column.
1761 1762 1763 1764 1765 |
# File 'lib/polars/data_frame.rb', line 1761 def sort(by, reverse: false, nulls_last: false) lazy .sort(by, reverse: reverse, nulls_last: nulls_last) .collect(no_optimization: true) end |
#sort!(by, reverse: false, nulls_last: false) ⇒ DataFrame
Sort the DataFrame by column in-place.
1777 1778 1779 |
# File 'lib/polars/data_frame.rb', line 1777 def sort!(by, reverse: false, nulls_last: false) self._df = sort(by, reverse: reverse, nulls_last: nulls_last)._df end |
#sql(query, table_name: "self") ⇒ DataFrame
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- The calling frame is automatically registered as a table in the SQL context
under the name "self". If you want access to the DataFrames and LazyFrames
found in the current globals, use the top-level :meth:
pl.sql <polars.sql>
. - More control over registration and execution behaviour is available by
using the :class:
SQLContext
object. - The SQL query executes in lazy mode before being collected and returned as a DataFrame.
Execute a SQL query against the DataFrame.
1849 1850 1851 1852 1853 1854 |
# File 'lib/polars/data_frame.rb', line 1849 def sql(query, table_name: "self") ctx = SQLContext.new(eager_execution: true) name = table_name || "self" ctx.register(name, self) ctx.execute(query) end |
#std(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their standard deviation value.
5003 5004 5005 |
# File 'lib/polars/data_frame.rb', line 5003 def std(ddof: 1) lazy.std(ddof: ddof).collect(_eager: true) end |
#sum ⇒ DataFrame
Aggregate the columns of this DataFrame to their sum value.
4876 4877 4878 |
# File 'lib/polars/data_frame.rb', line 4876 def sum lazy.sum.collect(_eager: true) end |
#sum_horizontal(ignore_nulls: true) ⇒ Series
Sum all values horizontally across columns.
4904 4905 4906 4907 4908 |
# File 'lib/polars/data_frame.rb', line 4904 def sum_horizontal(ignore_nulls: true) select( sum: F.sum_horizontal(F.all, ignore_nulls: ignore_nulls) ).to_series end |
#tail(n = 5) ⇒ DataFrame
Get the last n
rows.
2184 2185 2186 |
# File 'lib/polars/data_frame.rb', line 2184 def tail(n = 5) _from_rbdf(_df.tail(n)) end |
#to_a ⇒ Array
Returns an array representing the DataFrame
328 329 330 |
# File 'lib/polars/data_frame.rb', line 328 def to_a rows(named: true) end |
#to_csv(**options) ⇒ String
Write to comma-separated values (CSV) string.
819 820 821 |
# File 'lib/polars/data_frame.rb', line 819 def to_csv(**) write_csv(**) end |
#to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) ⇒ DataFrame
Get one hot encoded dummy variables.
5164 5165 5166 5167 5168 5169 |
# File 'lib/polars/data_frame.rb', line 5164 def to_dummies(columns: nil, separator: "_", drop_first: false, drop_nulls: false) if columns.is_a?(::String) columns = [columns] end _from_rbdf(_df.to_dummies(columns, separator, drop_first, drop_nulls)) end |
#to_h(as_series: true) ⇒ Hash
Convert DataFrame to a hash mapping column name to values.
555 556 557 558 559 560 561 |
# File 'lib/polars/data_frame.rb', line 555 def to_h(as_series: true) if as_series get_columns.to_h { |s| [s.name, s] } else get_columns.to_h { |s| [s.name, s.to_a] } end end |
#to_hashes ⇒ Array
Convert every row to a hash.
Note that this is slow.
574 575 576 577 578 579 580 581 |
# File 'lib/polars/data_frame.rb', line 574 def to_hashes rbdf = _df names = columns height.times.map do |i| names.zip(rbdf.row_tuple(i)).to_h end end |
#to_numo ⇒ Numo::NArray
Convert DataFrame to a 2D Numo array.
This operation clones data.
595 596 597 598 599 600 601 602 |
# File 'lib/polars/data_frame.rb', line 595 def to_numo out = _df.to_numo if out.nil? Numo::NArray.vstack(width.times.map { |i| to_series(i).to_numo }).transpose else out end end |
#to_s ⇒ String Also known as: inspect
Returns a string representing the DataFrame.
320 321 322 |
# File 'lib/polars/data_frame.rb', line 320 def to_s _df.to_s end |
#to_series(index = 0) ⇒ Series
Select column as Series at index location.
630 631 632 633 634 635 |
# File 'lib/polars/data_frame.rb', line 630 def to_series(index = 0) if index < 0 index = columns.length + index end Utils.wrap_s(_df.select_at_idx(index)) end |
#to_struct(name) ⇒ Series
Convert a DataFrame
to a Series
of type Struct
.
5951 5952 5953 |
# File 'lib/polars/data_frame.rb', line 5951 def to_struct(name) Utils.wrap_s(_df.to_struct(name)) end |
#top_k(k, by:, reverse: false) ⇒ DataFrame
Return the k
largest rows.
Non-null elements are always preferred over null elements, regardless of
the value of reverse
. The output is not guaranteed to be in any
particular order, call sort
after this function if you wish the
output to be sorted.
1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 |
# File 'lib/polars/data_frame.rb', line 1910 def top_k( k, by:, reverse: false ) lazy .top_k(k, by: by, reverse: reverse) .collect( # optimizations=QueryOptFlags( # projection_pushdown=False, # predicate_pushdown=False, # comm_subplan_elim=False, # slice_pushdown=True # ) ) end |
#transpose(include_header: false, header_name: "column", column_names: nil) ⇒ DataFrame
This is a very expensive operation. Perhaps you can do it differently.
Transpose a DataFrame over the diagonal.
1301 1302 1303 1304 |
# File 'lib/polars/data_frame.rb', line 1301 def transpose(include_header: false, header_name: "column", column_names: nil) keep_names_as = include_header ? header_name : nil _from_rbdf(_df.transpose(keep_names_as, column_names)) end |
#unique(maintain_order: true, subset: nil, keep: "first") ⇒ DataFrame
Note that this fails if there is a column of type List
in the DataFrame or
subset.
Drop duplicate rows from this DataFrame.
5209 5210 5211 5212 5213 5214 5215 5216 |
# File 'lib/polars/data_frame.rb', line 5209 def unique(maintain_order: true, subset: nil, keep: "first") self._from_rbdf( lazy .unique(maintain_order: maintain_order, subset: subset, keep: keep) .collect(no_optimization: true) ._df ) end |
#unnest(names) ⇒ DataFrame
Decompose a struct into its fields.
The fields will be inserted into the DataFrame
on the location of the
struct
type.
5987 5988 5989 5990 5991 5992 |
# File 'lib/polars/data_frame.rb', line 5987 def unnest(names) if names.is_a?(::String) names = [names] end _from_rbdf(_df.unnest(names)) end |
#unpivot(on, index: nil, variable_name: nil, value_name: nil) ⇒ DataFrame Also known as: melt
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
4103 4104 4105 4106 4107 4108 |
# File 'lib/polars/data_frame.rb', line 4103 def unpivot(on, index: nil, variable_name: nil, value_name: nil) on = on.nil? ? [] : Utils.(self, on) index = index.nil? ? [] : Utils.(self, index) _from_rbdf(_df.unpivot(on, index, value_name, variable_name)) end |
#unstack(step:, how: "vertical", columns: nil, fill_values: nil) ⇒ DataFrame
This functionality is experimental and may be subject to changes without it being considered a breaking change.
Unstack a long table to a wide form without doing an aggregation.
This can be much faster than a pivot, because it can skip the grouping phase.
4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 |
# File 'lib/polars/data_frame.rb', line 4182 def unstack(step:, how: "vertical", columns: nil, fill_values: nil) if !columns.nil? df = select(columns) else df = self end height = df.height if how == "vertical" n_rows = step n_cols = (height / n_rows.to_f).ceil else n_cols = step n_rows = (height / n_cols.to_f).ceil end n_fill = n_cols * n_rows - height if n_fill > 0 if !fill_values.is_a?(::Array) fill_values = [fill_values] * df.width end df = df.select( df.get_columns.zip(fill_values).map do |s, next_fill| s.extend_constant(next_fill, n_fill) end ) end if how == "horizontal" df = ( df.with_column( (Polars.arange(0, n_cols * n_rows, eager: true) % n_cols).alias( "__sort_order" ) ) .sort("__sort_order") .drop("__sort_order") ) end zfill_val = Math.log10(n_cols).floor + 1 slices = df.get_columns.flat_map do |s| n_cols.times.map do |slice_nbr| s.slice(slice_nbr * n_rows, n_rows).alias("%s_%0#{zfill_val}d" % [s.name, slice_nbr]) end end _from_rbdf(DataFrame.new(slices)._df) end |
#update(other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left") ⇒ DataFrame
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
This is syntactic sugar for a left/inner join that preserves the order
of the left DataFrame
by default, with an optional coalesce when
include_nulls: false
.
Update the values in this DataFrame
with the values in other
.
6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 |
# File 'lib/polars/data_frame.rb', line 6170 def update( other, on: nil, how: "left", left_on: nil, right_on: nil, include_nulls: false, maintain_order: "left" ) Utils.require_same_type(self, other) lazy .update( other.lazy, on: on, how: how, left_on: left_on, right_on: right_on, include_nulls: include_nulls, maintain_order: maintain_order ) .collect(_eager: true) end |
#upsample(time_column:, every:, by: nil, maintain_order: false) ⇒ DataFrame
Upsample a DataFrame at a regular frequency.
The every
and offset
arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 |
# File 'lib/polars/data_frame.rb', line 2828 def upsample( time_column:, every:, by: nil, maintain_order: false ) if by.nil? by = [] end if by.is_a?(::String) by = [by] end every = Utils.parse_as_duration_string(every) _from_rbdf( _df.upsample(by, time_column, every, maintain_order) ) end |
#var(ddof: 1) ⇒ DataFrame
Aggregate the columns of this DataFrame to their variance value.
5044 5045 5046 |
# File 'lib/polars/data_frame.rb', line 5044 def var(ddof: 1) lazy.var(ddof: ddof).collect(_eager: true) end |
#vstack(df, in_place: false) ⇒ DataFrame
Grow this DataFrame vertically by stacking a DataFrame to it.
3446 3447 3448 3449 3450 3451 3452 3453 |
# File 'lib/polars/data_frame.rb', line 3446 def vstack(df, in_place: false) if in_place _df.vstack_mut(df._df) self else _from_rbdf(_df.vstack(df._df)) end end |
#width ⇒ Integer
Get the width of the DataFrame.
117 118 119 |
# File 'lib/polars/data_frame.rb', line 117 def width _df.width end |
#with_column(column) ⇒ DataFrame
Return a new DataFrame with the column added or replaced.
3361 3362 3363 3364 3365 |
# File 'lib/polars/data_frame.rb', line 3361 def with_column(column) lazy .with_column(column) .collect(no_optimization: true, string_cache: false) end |
#with_columns(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
4695 4696 4697 |
# File 'lib/polars/data_frame.rb', line 4695 def with_columns(*exprs, **named_exprs) lazy.with_columns(*exprs, **named_exprs).collect(_eager: true) end |
#with_columns_seq(*exprs, **named_exprs) ⇒ DataFrame
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
4715 4716 4717 4718 4719 4720 4721 4722 |
# File 'lib/polars/data_frame.rb', line 4715 def with_columns_seq( *exprs, **named_exprs ) lazy .with_columns_seq(*exprs, **named_exprs) .collect(_eager: true) end |
#with_row_index(name: "index", offset: 0) ⇒ DataFrame Also known as: with_row_count
Add a column at index 0 that counts the rows.
2347 2348 2349 |
# File 'lib/polars/data_frame.rb', line 2347 def with_row_index(name: "index", offset: 0) _from_rbdf(_df.with_row_index(name, offset)) end |
#write_avro(file, compression = "uncompressed", name: "") ⇒ nil
Write to Apache Avro file.
833 834 835 836 837 838 839 840 841 842 843 844 845 |
# File 'lib/polars/data_frame.rb', line 833 def write_avro(file, compression = "uncompressed", name: "") if compression.nil? compression = "uncompressed" end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if name.nil? name = "" end _df.write_avro(file, compression, name) end |
#write_csv(file = nil, include_header: true, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil) ⇒ String?
Write to comma-separated values (CSV) file.
759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 |
# File 'lib/polars/data_frame.rb', line 759 def write_csv( file = nil, include_header: true, sep: ",", quote: '"', batch_size: 1024, datetime_format: nil, date_format: nil, time_format: nil, float_precision: nil, null_value: nil ) if sep.length > 1 raise ArgumentError, "only single byte separator is allowed" elsif quote.length > 1 raise ArgumentError, "only single byte quote char is allowed" elsif null_value == "" null_value = nil end if file.nil? buffer = StringIO.new buffer.set_encoding(Encoding::BINARY) _df.write_csv( buffer, include_header, sep.ord, quote.ord, batch_size, datetime_format, date_format, time_format, float_precision, null_value ) return buffer.string.force_encoding(Encoding::UTF_8) end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end _df.write_csv( file, include_header, sep.ord, quote.ord, batch_size, datetime_format, date_format, time_format, float_precision, null_value, ) nil end |
#write_database(table_name, connection = nil, if_table_exists: "fail") ⇒ Integer
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
Write the data in a Polars DataFrame to a database.
1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 |
# File 'lib/polars/data_frame.rb', line 1036 def write_database(table_name, connection = nil, if_table_exists: "fail") if !defined?(ActiveRecord) raise Error, "Active Record not available" elsif ActiveRecord::VERSION::MAJOR < 7 raise Error, "Requires Active Record 7+" end valid_write_modes = ["append", "replace", "fail"] if !valid_write_modes.include?(if_table_exists) msg = "write_database `if_table_exists` must be one of #{valid_write_modes.inspect}, got #{if_table_exists.inspect}" raise ArgumentError, msg end with_connection(connection) do |connection| table_exists = connection.table_exists?(table_name) if table_exists && if_table_exists == "fail" raise ArgumentError, "Table already exists" end create_table = !table_exists || if_table_exists == "replace" maybe_transaction(connection, create_table) do if create_table mysql = connection.adapter_name.match?(/mysql|trilogy/i) force = if_table_exists == "replace" connection.create_table(table_name, id: false, force: force) do |t| schema.each do |c, dtype| = {} column_type = case dtype when Binary :binary when Boolean :boolean when Date :date when Datetime :datetime when Decimal if mysql [:precision] = dtype.precision || 65 [:scale] = dtype.scale || 30 end :decimal when Float32 [:limit] = 24 :float when Float64 [:limit] = 53 :float when Int8 [:limit] = 1 :integer when Int16 [:limit] = 2 :integer when Int32 [:limit] = 4 :integer when Int64 [:limit] = 8 :integer when UInt8 if mysql [:limit] = 1 [:unsigned] = true else [:limit] = 2 end :integer when UInt16 if mysql [:limit] = 2 [:unsigned] = true else [:limit] = 4 end :integer when UInt32 if mysql [:limit] = 4 [:unsigned] = true else [:limit] = 8 end :integer when UInt64 if mysql [:limit] = 8 [:unsigned] = true :integer else [:precision] = 20 [:scale] = 0 :decimal end when String :text when Time :time else raise ArgumentError, "column type not supported yet: #{dtype}" end t.column c, column_type, ** end end end quoted_table = connection.quote_table_name(table_name) quoted_columns = columns.map { |c| connection.quote_column_name(c) } rows = cast({Polars::UInt64 => Polars::String}).rows(named: false).map { |row| "(#{row.map { |v| connection.quote(v) }.join(", ")})" } connection.exec_update("INSERT INTO #{quoted_table} (#{quoted_columns.join(", ")}) VALUES #{rows.join(", ")}") end end end |
#write_delta(target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil) ⇒ nil
Write DataFrame as delta table.
1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 |
# File 'lib/polars/data_frame.rb', line 1165 def write_delta( target, mode: "error", storage_options: nil, delta_write_options: nil, delta_merge_options: nil ) Polars.send(:_check_if_delta_available) if Utils.pathlike?(target) target = Polars.send(:_resolve_delta_lake_uri, target.to_s, strict: false) end data = self if mode == "merge" if .nil? msg = "You need to pass delta_merge_options with at least a given predicate for `MERGE` to work." raise ArgumentError, msg end if target.is_a?(::String) dt = DeltaLake::Table.new(target, storage_options: ) else dt = target end predicate = .delete(:predicate) dt.merge(data, predicate, **) else ||= {} DeltaLake.write( target, data, mode: mode, storage_options: , ** ) end end |
#write_ipc(file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2) ⇒ nil
Write to Arrow IPC binary stream or Feather file.
873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 |
# File 'lib/polars/data_frame.rb', line 873 def write_ipc( file, compression: "uncompressed", compat_level: nil, storage_options: nil, retries: 2 ) return_bytes = file.nil? if return_bytes file = StringIO.new file.set_encoding(Encoding::BINARY) end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if compat_level.nil? compat_level = true end if compression.nil? compression = "uncompressed" end if &.any? = .to_a else = nil end _df.write_ipc(file, compression, compat_level, , retries) return_bytes ? file.string : nil end |
#write_ipc_stream(file, compression: "uncompressed", compat_level: nil) ⇒ Object
Write to Arrow IPC record batch stream.
See "Streaming format" in https://arrow.apache.org/docs/python/ipc.html.
931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 |
# File 'lib/polars/data_frame.rb', line 931 def write_ipc_stream( file, compression: "uncompressed", compat_level: nil ) return_bytes = file.nil? if return_bytes file = StringIO.new file.set_encoding(Encoding::BINARY) elsif Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if compat_level.nil? compat_level = true end if compression.nil? compression = "uncompressed" end _df.write_ipc_stream(file, compression, compat_level) return_bytes ? file.string : nil end |
#write_json(file = nil) ⇒ nil
Serialize to JSON representation.
653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 |
# File 'lib/polars/data_frame.rb', line 653 def write_json(file = nil) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end to_string_io = !file.nil? && file.is_a?(StringIO) if file.nil? || to_string_io buf = StringIO.new buf.set_encoding(Encoding::BINARY) _df.write_json(buf) json_bytes = buf.string json_str = json_bytes.force_encoding(Encoding::UTF_8) if to_string_io file.write(json_str) else return json_str end else _df.write_json(file) end nil end |
#write_ndjson(file = nil) ⇒ nil
Serialize to newline delimited JSON representation.
692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 |
# File 'lib/polars/data_frame.rb', line 692 def write_ndjson(file = nil) if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end to_string_io = !file.nil? && file.is_a?(StringIO) if file.nil? || to_string_io buf = StringIO.new buf.set_encoding(Encoding::BINARY) _df.write_ndjson(buf) json_bytes = buf.string json_str = json_bytes.force_encoding(Encoding::UTF_8) if to_string_io file.write(json_str) else return json_str end else _df.write_ndjson(file) end nil end |
#write_parquet(file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil) ⇒ nil
Write to Apache Parquet file.
980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 |
# File 'lib/polars/data_frame.rb', line 980 def write_parquet( file, compression: "zstd", compression_level: nil, statistics: false, row_group_size: nil, data_page_size: nil ) if compression.nil? compression = "uncompressed" end if Utils.pathlike?(file) file = Utils.normalize_filepath(file) end if statistics == true statistics = { min: true, max: true, distinct_count: false, null_count: true } elsif statistics == false statistics = {} elsif statistics == "full" statistics = { min: true, max: true, distinct_count: true, null_count: true } end _df.write_parquet( file, compression, compression_level, statistics, row_group_size, data_page_size ) end |