Module: EnumerableStats::EnumerableExt

Included in:: Enumerable

Defined in:: lib/enumerable_stats/enumerable_ext.rb

Overview

Extension module that adds statistical methods to all Enumerable objects.

This module provides essential statistical functions including measures of central tendency (mean, median), measures of dispersion (variance, standard deviation), percentile calculations, outlier detection using the IQR method, and statistical comparison methods.

When included, these methods become available on all Ruby collections that include Enumerable (Arrays, Ranges, Sets, etc.), enabling seamless statistical analysis without external dependencies.

Examples:

Basic statistical calculations

[1, 2, 3, 4, 5].mean          #=> 3.0
[1, 2, 3, 4, 5].median        #=> 3
[1, 2, 3, 4, 5].percentile(75) #=> 4.0

Outlier detection

data = [1, 2, 3, 4, 100]
data.remove_outliers           #=> [1, 2, 3, 4]
data.outlier_stats             #=> { outliers_removed: 1, percentage: 20.0, ... }

Statistical testing

control = [10, 12, 14, 16, 18]
treatment = [15, 17, 19, 21, 23]
control.t_value(treatment)     #=> negative t-statistic
control.degrees_of_freedom(treatment) #=> degrees of freedom for Welch's t-test
treatment.greater_than?(control) #=> true (treatment mean significantly > control mean)
control.less_than?(treatment)    #=> true (control mean significantly < treatment mean)

Constant Summary collapse

EPSILON = Epsilon for floating point comparisons to avoid precision issues Since: 0.1.0

1e-10

COMMON_ALPHA_VALUES = Common alpha levels with their corresponding high-precision z-scores Used to avoid floating point comparison issues while maintaining backward compatibility Since: 0.1.0

{
  0.10 => 1.2815515655446004,
  0.05 => 1.6448536269514722,
  0.025 => 1.9599639845400545,
  0.01 => 2.3263478740408408,
  0.005 => 2.5758293035489004,
  0.001 => 3.0902323061678132
}.freeze

CORNISH_FISHER_FOURTH_ORDER_DENOMINATOR = Since: 0.1.0

92_160.0

EDGEWORTH_SMALL_SAMPLE_COEFF = Since: 0.1.0

4.0

BSM_THRESHOLD = Since: 0.1.0

1e-20

Instance Method Summary collapse

#<(other, alpha: 0.05) ⇒ Boolean

Operator alias for less_than? - tests if this collection’s mean is significantly less.
#<=>(other, alpha: 0.05) ⇒ Integer

Tests if this collection’s mean is significantly different from another collection’s mean using a two-tailed Student’s t-test.
#>(other, alpha: 0.05) ⇒ Boolean

Operator alias for greater_than? - tests if this collection’s mean is significantly greater.
#degrees_of_freedom(other) ⇒ Float

Calculates the degrees of freedom for comparing two samples using Welch’s formula This is used in statistical hypothesis testing when sample variances are unequal The formula accounts for different sample sizes and variances between groups.
#greater_than?(other, alpha: 0.05) ⇒ Boolean

Tests if this collection’s mean is significantly greater than another collection’s mean using a one-tailed Student’s t-test.
#less_than?(other, alpha: 0.05) ⇒ Boolean

Tests if this collection’s mean is significantly less than another collection’s mean using a one-tailed Student’s t-test.
#mean ⇒ Float

Calculates the arithmetic mean (average) of the collection.
#median ⇒ Numeric^?

Calculates the median (middle value) of the collection For collections with an even number of elements, returns the average of the two middle values.
#outlier_stats(multiplier: 1.5) ⇒ Hash

Returns statistics about outlier removal for debugging/logging Provides detailed information about how many outliers were removed and their percentage.
#percentage_difference(other) ⇒ Float

Calculates the percentage difference between this collection’s mean and another value or collection’s mean Uses the symmetric percentage difference formula: |a - b| / ((a + b) / 2) * 100 This is useful for comparing datasets or metrics where direction doesn’t matter.
#percentile(percentile) ⇒ Numeric^?

Calculates the specified percentile of the collection Uses linear interpolation between data points when the exact percentile falls between values This is equivalent to the “linear” method used by many statistical software packages.
#remove_outliers(multiplier: 1.5) ⇒ Array

Removes extreme outliers using the IQR (Interquartile Range) method This is particularly effective for performance data which often has extreme values due to network issues, CPU scheduling, GC pauses, etc.
#signed_percentage_difference(other) ⇒ Float

Calculates the signed percentage difference between this collection’s mean and another value or collection’s mean Uses the signed percentage difference formula: (a - b) / ((a + b) / 2) * 100 Useful for performance comparisons where direction matters (e.g., improvements vs regressions).
#standard_deviation ⇒ Float

Calculates the sample standard deviation of the collection Returns the square root of the sample variance.
#t_value(other) ⇒ Float

Calculates the t-statistic for comparing the means of two samples Uses Welch’s t-test formula which doesn’t assume equal variances A larger absolute t-value indicates a greater difference between sample means.
#variance ⇒ Float

Calculates the sample variance of the collection Uses the unbiased formula with n-1 degrees of freedom (Bessel’s correction).

Instance Method Details

#<(other, alpha: 0.05) ⇒ `Boolean`

Operator alias for less_than? - tests if this collection’s mean is significantly less

Examples:

optimized = [85, 95, 90, 100, 80]
baseline = [100, 110, 105, 115, 95]
optimized < baseline  # => true (optimized is significantly less)

Parameters:

other (Enumerable) —

Another collection to compare against
alpha (Float) (defaults to: 0.05) —

The significance level (default: 0.05 for 95% confidence)

Returns:

(Boolean) —

true if this collection’s mean is significantly less

Since:

0.1.0



192
193
194

# File 'lib/enumerable_stats/enumerable_ext.rb', line 192

def <(other, alpha: 0.05)
  less_than?(other, alpha: alpha)
end

#<=>(other, alpha: 0.05) ⇒ `Integer`

Tests if this collection’s mean is significantly different from another collection’s mean using a two-tailed Student’s t-test. Returns 1 if the test indicates statistical significance at the specified alpha level, -1 if the test indicates statistical significance at the specified alpha level, and 0 if the test indicates no statistical significance at the specified alpha level.

Examples:

control = [10, 12, 11, 13, 12]     # mean ≈ 11.6
treatment = [15, 17, 16, 18, 14]   # mean = 16.0
control <=> treatment              # => 1 (control is significantly different from treatment)
treatment <=> control              # => -1 (treatment is significantly different from control)
control <=> control                # => 0 (control is not significantly different from itself)

Parameters:

other (Enumerable) —

Another collection to compare against
alpha (Float) (defaults to: 0.05) —

Significance level (default: 0.05 for 95% confidence)

Returns:

(Integer) —

1 if this collection’s mean is significantly greater, -1 if this collection’s mean is significantly less, 0 if this collection’s mean is not significantly different

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 212

def <=>(other, alpha: 0.05)
  if greater_than?(other, alpha: alpha)
    1
  elsif less_than?(other, alpha: alpha)
    -1
  else
    0
  end
end

#>(other, alpha: 0.05) ⇒ `Boolean`

Operator alias for greater_than? - tests if this collection’s mean is significantly greater

Examples:

baseline = [100, 110, 105, 115, 95]
optimized = [85, 95, 90, 100, 80]
baseline > optimized  # => true (baseline is significantly greater)

Parameters:

other (Enumerable) —

Another collection to compare against
alpha (Float) (defaults to: 0.05) —

The significance level (default: 0.05 for 95% confidence)

Returns:

(Boolean) —

true if this collection’s mean is significantly greater

Since:

0.1.0



179
180
181

# File 'lib/enumerable_stats/enumerable_ext.rb', line 179

def >(other, alpha: 0.05)
  greater_than?(other, alpha: alpha)
end

#degrees_of_freedom(other) ⇒ `Float`

Calculates the degrees of freedom for comparing two samples using Welch’s formula This is used in statistical hypothesis testing when sample variances are unequal The formula accounts for different sample sizes and variances between groups

Examples:

sample_a = [10, 12, 14, 16, 18]
sample_b = [5, 15, 25, 35, 45, 55]
df = sample_a.degrees_of_freedom(sample_b)  # => ~7.2

Parameters:

other (Enumerable) —

Another collection to compare against

Returns:

(Float) —

Degrees of freedom for statistical testing

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 118

def degrees_of_freedom(other)
  n1 = variance / count
  n2 = other.variance / other.count

  n = (n1 + n2)**2

  d1 = (variance**2) / ((count**2) * (count - 1))
  d2 = (other.variance**2) / ((other.count**2) * (other.count - 1))

  n / (d1 + d2)
end

#greater_than?(other, alpha: 0.05) ⇒ `Boolean`

Tests if this collection’s mean is significantly greater than another collection’s mean using a one-tailed Student’s t-test. Returns true if the test indicates statistical significance at the specified alpha level.

Examples:

control = [10, 12, 11, 13, 12]     # mean ≈ 11.6
treatment = [15, 17, 16, 18, 14]   # mean = 16.0
treatment.greater_than?(control)   # => true (treatment significantly > control)
control.greater_than?(treatment)   # => false

Parameters:

other (Enumerable) —

Another collection to compare against
alpha (Float) (defaults to: 0.05) —

Significance level (default: 0.05 for 95% confidence)

Returns:

(Boolean) —

True if this collection’s mean is significantly greater

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 142

def greater_than?(other, alpha: 0.05)
  t_stat = t_value(other)
  df = degrees_of_freedom(other)
  critical_value = critical_t_value(df, alpha)

  t_stat > critical_value
end

#less_than?(other, alpha: 0.05) ⇒ `Boolean`

Tests if this collection’s mean is significantly less than another collection’s mean using a one-tailed Student’s t-test. Returns true if the test indicates statistical significance at the specified alpha level.

Examples:

control = [10, 12, 11, 13, 12]     # mean ≈ 11.6
treatment = [15, 17, 16, 18, 14]   # mean = 16.0
control.less_than?(treatment)      # => true (control significantly < treatment)
treatment.less_than?(control)      # => false

Parameters:

other (Enumerable) —

Another collection to compare against
alpha (Float) (defaults to: 0.05) —

Significance level (default: 0.05 for 95% confidence)

Returns:

(Boolean) —

True if this collection’s mean is significantly less

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 162

def less_than?(other, alpha: 0.05)
  t_stat = t_value(other)
  df = degrees_of_freedom(other)
  critical_value = critical_t_value(df, alpha)

  t_stat < -critical_value
end

#mean ⇒ `Float`

Calculates the arithmetic mean (average) of the collection

Examples:

[1, 2, 3, 4, 5].mean  # => 3.0
(1..10).mean          # => 5.5

Returns:

(Float) —

The arithmetic mean of all numeric values

Since:

0.1.0



228
229
230

# File 'lib/enumerable_stats/enumerable_ext.rb', line 228

def mean
  sum / size.to_f
end

#median ⇒ `Numeric`^?

Calculates the median (middle value) of the collection For collections with an even number of elements, returns the average of the two middle values

Examples:

[1, 2, 3, 4, 5].median        # => 3
[1, 2, 3, 4].median           # => 2.5
[5, 1, 3, 2, 4].median        # => 3 (automatically sorts)
[].median                     # => nil

Returns:

(Numeric, nil) —

The median value, or nil if the collection is empty

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 241

def median
  return nil if size.zero?

  sorted = sort
  midpoint = size / 2

  if size.even?
    sorted[midpoint - 1, 2].sum / 2.0
  else
    sorted[midpoint]
  end
end

#outlier_stats(multiplier: 1.5) ⇒ `Hash`

Returns statistics about outlier removal for debugging/logging Provides detailed information about how many outliers were removed and their percentage

Examples:

data = [1, 2, 3, 4, 5, 100]
stats = data.outlier_stats
# => {original_count: 6, filtered_count: 5, outliers_removed: 1, outlier_percentage: 16.67}

Parameters:

multiplier (Float) (defaults to: 1.5) —

IQR multiplier for outlier detection (1.5 is standard, 2.0 is more conservative)

Returns:

(Hash) —

Statistics hash containing :original_count, :filtered_count, :outliers_removed, :outlier_percentage

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 382

def outlier_stats(multiplier: 1.5)
  original_count = size
  filtered = remove_outliers(multiplier: multiplier)

  {
    original_count: original_count,
    filtered_count: filtered.size,
    outliers_removed: original_count - filtered.size,
    outlier_percentage: ((original_count - filtered.size).to_f / original_count * 100).round(2)
  }
end

#percentage_difference(other) ⇒ `Float`

Calculates the percentage difference between this collection’s mean and another value or collection’s mean Uses the symmetric percentage difference formula: |a - b| / ((a + b) / 2) * 100 This is useful for comparing datasets or metrics where direction doesn’t matter

Parameters:

other (Numeric, Enumerable) —

Value or collection to compare against

Returns:

(Float) —

Absolute percentage difference (always positive)

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 59

def percentage_difference(other)
  a = mean.to_f
  b = other.respond_to?(:mean) ? other.mean.to_f : other.to_f

  return 0.0 if a == b
  return Float::INFINITY if (a + b).zero?

  ((a - b).abs / ((a + b) / 2.0).abs) * 100
end

#percentile(percentile) ⇒ `Numeric`^?

Calculates the specified percentile of the collection Uses linear interpolation between data points when the exact percentile falls between values This is equivalent to the “linear” method used by many statistical software packages

Examples:

[1, 2, 3, 4, 5].percentile(50)    # => 3 (same as median)
[1, 2, 3, 4, 5].percentile(25)    # => 2.0 (25th percentile)
[1, 2, 3, 4, 5].percentile(75)    # => 4.0 (75th percentile)
[1, 2, 3, 4, 5].percentile(0)     # => 1 (minimum value)
[1, 2, 3, 4, 5].percentile(100)   # => 5 (maximum value)
[].percentile(50)                 # => nil (empty collection)

Parameters:

percentile (Numeric) —

The percentile to calculate (0-100)

Returns:

(Numeric, nil) —

The value at the specified percentile, or nil if the collection is empty

Raises:

(ArgumentError) —

If percentile is not between 0 and 100

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 268

def percentile(percentile)
  return nil if size.zero?

  unless percentile.is_a?(Numeric) && percentile >= 0 && percentile <= 100
    raise ArgumentError, "Percentile must be a number between 0 and 100, got #{percentile}"
  end

  sorted = sort

  # Handle edge cases
  return sorted.first if percentile.zero?
  return sorted.last if percentile == 100

  # Calculate the position using the "linear" method (R-7/Excel method)
  # This is the most commonly used method in statistical software
  position = (size - 1) * (percentile / 100.0)

  # If position is an integer, return that exact element
  if position == position.floor
    sorted[position.to_i]
  else
    # Linear interpolation between the two surrounding values
    lower_index = position.floor
    upper_index = position.ceil
    weight = position - position.floor

    lower_value = sorted[lower_index]
    upper_value = sorted[upper_index]

    lower_value + (weight * (upper_value - lower_value))
  end
end

#remove_outliers(multiplier: 1.5) ⇒ `Array`

Removes extreme outliers using the IQR (Interquartile Range) method This is particularly effective for performance data which often has extreme values due to network issues, CPU scheduling, GC pauses, etc.

Parameters:

multiplier (Float) (defaults to: 1.5) —

IQR multiplier (1.5 is standard, 2.0 is more conservative)

Returns:

(Array) —

Array with outliers removed

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 331

def remove_outliers(multiplier: 1.5)
  return self if size < 4 # Need minimum data points for quartiles

  sorted = sort
  n = size

  # Use the standard quartile calculation with interpolation
  # Q1 position = (n-1) * 0.25
  # Q3 position = (n-1) * 0.75
  q1_pos = (n - 1) * 0.25
  q3_pos = (n - 1) * 0.75

  # Calculate Q1
  if q1_pos == q1_pos.floor
    q1 = sorted[q1_pos.to_i]
  else
    lower_index = q1_pos.floor
    upper_index = q1_pos.ceil
    weight = q1_pos - q1_pos.floor
    q1 = sorted[lower_index] + (weight * (sorted[upper_index] - sorted[lower_index]))
  end

  # Calculate Q3
  if q3_pos == q3_pos.floor
    q3 = sorted[q3_pos.to_i]
  else
    lower_index = q3_pos.floor
    upper_index = q3_pos.ceil
    weight = q3_pos - q3_pos.floor
    q3 = sorted[lower_index] + (weight * (sorted[upper_index] - sorted[lower_index]))
  end

  iqr = q3 - q1

  # Calculate bounds
  lower_bound = q1 - (multiplier * iqr)
  upper_bound = q3 + (multiplier * iqr)

  # Filter out outliers
  select { |value| value.between?(lower_bound, upper_bound) }
end

#signed_percentage_difference(other) ⇒ `Float`

Calculates the signed percentage difference between this collection’s mean and another value or collection’s mean Uses the signed percentage difference formula: (a - b) / ((a + b) / 2) * 100 Useful for performance comparisons where direction matters (e.g., improvements vs regressions)

Parameters:

other (Numeric, Enumerable) —

Value or collection to compare against

Returns:

(Float) —

Signed percentage difference (positive = this collection is higher, negative = other is higher)

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 75

def signed_percentage_difference(other)
  a = mean.to_f
  b = other.respond_to?(:mean) ? other.mean.to_f : other.to_f

  return 0.0 if a == b
  return Float::INFINITY if (a + b).zero?

  ((a - b) / ((a + b) / 2.0).abs) * 100
end

#standard_deviation ⇒ `Float`

Calculates the sample standard deviation of the collection Returns the square root of the sample variance

Examples:

[1, 2, 3, 4, 5].standard_deviation    # => 1.58
[5, 5, 5, 5].standard_deviation       # => 0.0

Returns:

(Float) —

The sample standard deviation

Since:

0.1.0



321
322
323

# File 'lib/enumerable_stats/enumerable_ext.rb', line 321

def standard_deviation
  Math.sqrt variance
end

#t_value(other) ⇒ `Float`

Calculates the t-statistic for comparing the means of two samples Uses Welch’s t-test formula which doesn’t assume equal variances A larger absolute t-value indicates a greater difference between sample means

Examples:

control = [10, 12, 11, 13, 12]
treatment = [15, 17, 16, 18, 14]
t_stat = control.t_value(treatment)  # => ~-4.2 (negative means treatment > control)

Parameters:

other (Enumerable) —

Another collection to compare against

Returns:

(Float) —

The t-statistic value (can be positive or negative)

Raises:

(ArgumentError)

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 95

def t_value(other)
  raise ArgumentError, "Cannot compare with an empty collection" if empty? || other.empty?
  raise ArgumentError, "Parameter must be an Enumerable" unless other.respond_to?(:mean)

  signal = (mean - other.mean)
  noise = Math.sqrt(
    ((standard_deviation**2) / count) +
      ((other.standard_deviation**2) / other.count)
  )

  (signal / noise)
end

#variance ⇒ `Float`

Calculates the sample variance of the collection Uses the unbiased formula with n-1 degrees of freedom (Bessel’s correction)

Examples:

[1, 2, 3, 4, 5].variance      # => 2.5
[5, 5, 5, 5].variance         # => 0.0 (no variation)

Returns:

(Float) —

The sample variance

Since:

0.1.0

# File 'lib/enumerable_stats/enumerable_ext.rb', line 308

def variance
  mean = self.mean
  sum_of_squares = sum { |r| (r - mean)**2 }
  sum_of_squares / (count - 1).to_f
end

Module: EnumerableStats::EnumerableExt

Overview

Examples:

Basic statistical calculations

Outlier detection

Statistical testing

Constant Summary collapse

Instance Method Summary collapse

Instance Method Details

#<(other, alpha: 0.05) ⇒ Boolean

Examples:

#<=>(other, alpha: 0.05) ⇒ Integer

Examples:

#>(other, alpha: 0.05) ⇒ Boolean

Examples:

#degrees_of_freedom(other) ⇒ Float

Examples:

#greater_than?(other, alpha: 0.05) ⇒ Boolean

Examples:

#less_than?(other, alpha: 0.05) ⇒ Boolean

Examples:

#mean ⇒ Float

Examples:

#median ⇒ Numeric?

Examples:

#outlier_stats(multiplier: 1.5) ⇒ Hash

Examples:

#percentage_difference(other) ⇒ Float

#percentile(percentile) ⇒ Numeric?

Examples:

#remove_outliers(multiplier: 1.5) ⇒ Array

#signed_percentage_difference(other) ⇒ Float

#standard_deviation ⇒ Float

Examples:

#t_value(other) ⇒ Float

Examples:

#variance ⇒ Float

Examples:

#<(other, alpha: 0.05) ⇒ `Boolean`

#<=>(other, alpha: 0.05) ⇒ `Integer`

#>(other, alpha: 0.05) ⇒ `Boolean`

#degrees_of_freedom(other) ⇒ `Float`

#greater_than?(other, alpha: 0.05) ⇒ `Boolean`

#less_than?(other, alpha: 0.05) ⇒ `Boolean`

#mean ⇒ `Float`

#median ⇒ `Numeric`^?

#outlier_stats(multiplier: 1.5) ⇒ `Hash`

#percentage_difference(other) ⇒ `Float`

#percentile(percentile) ⇒ `Numeric`^?

#remove_outliers(multiplier: 1.5) ⇒ `Array`

#signed_percentage_difference(other) ⇒ `Float`

#standard_deviation ⇒ `Float`

#t_value(other) ⇒ `Float`

#variance ⇒ `Float`