Module: EnumerableStats::EnumerableExt

Included in:
Enumerable
Defined in:
lib/enumerable_stats/enumerable_ext.rb

Overview

Extension module that adds statistical methods to all Enumerable objects.

This module provides essential statistical functions including measures of central tendency (mean, median), measures of dispersion (variance, standard deviation), percentile calculations, outlier detection using the IQR method, and statistical comparison methods.

When included, these methods become available on all Ruby collections that include Enumerable (Arrays, Ranges, Sets, etc.), enabling seamless statistical analysis without external dependencies.

Examples:

Basic statistical calculations

[1, 2, 3, 4, 5].mean          #=> 3.0
[1, 2, 3, 4, 5].median        #=> 3
[1, 2, 3, 4, 5].percentile(75) #=> 4.0

Outlier detection

data = [1, 2, 3, 4, 100]
data.remove_outliers           #=> [1, 2, 3, 4]
data.outlier_stats             #=> { outliers_removed: 1, percentage: 20.0, ... }

Statistical testing

control = [10, 12, 14, 16, 18]
treatment = [15, 17, 19, 21, 23]
control.t_value(treatment)     #=> negative t-statistic
control.degrees_of_freedom(treatment) #=> degrees of freedom for Welch's t-test
treatment.greater_than?(control) #=> true (treatment mean significantly > control mean)
control.less_than?(treatment)    #=> true (control mean significantly < treatment mean)

See Also:

Since:

  • 0.1.0

Constant Summary collapse

EPSILON =

Epsilon for floating point comparisons to avoid precision issues

Since:

  • 0.1.0

1e-10
COMMON_ALPHA_VALUES =

Common alpha levels with their corresponding high-precision z-scores Used to avoid floating point comparison issues while maintaining backward compatibility

Since:

  • 0.1.0

{
  0.10 => 1.2815515655446004,
  0.05 => 1.6448536269514722,
  0.025 => 1.9599639845400545,
  0.01 => 2.3263478740408408,
  0.005 => 2.5758293035489004,
  0.001 => 3.0902323061678132
}.freeze
CORNISH_FISHER_FOURTH_ORDER_DENOMINATOR =

Since:

  • 0.1.0

92_160.0
EDGEWORTH_SMALL_SAMPLE_COEFF =

Since:

  • 0.1.0

4.0
BSM_THRESHOLD =

Since:

  • 0.1.0

1e-20

Instance Method Summary collapse

Instance Method Details

#<(other, alpha: 0.05) ⇒ Boolean

Operator alias for less_than? - tests if this collection’s mean is significantly less

Examples:

optimized = [85, 95, 90, 100, 80]
baseline = [100, 110, 105, 115, 95]
optimized < baseline  # => true (optimized is significantly less)

Parameters:

  • other (Enumerable)

    Another collection to compare against

  • alpha (Float) (defaults to: 0.05)

    The significance level (default: 0.05 for 95% confidence)

Returns:

  • (Boolean)

    true if this collection’s mean is significantly less

Since:

  • 0.1.0



192
193
194
# File 'lib/enumerable_stats/enumerable_ext.rb', line 192

def <(other, alpha: 0.05)
  less_than?(other, alpha: alpha)
end

#<=>(other, alpha: 0.05) ⇒ Integer

Tests if this collection’s mean is significantly different from another collection’s mean using a two-tailed Student’s t-test. Returns 1 if the test indicates statistical significance at the specified alpha level, -1 if the test indicates statistical significance at the specified alpha level, and 0 if the test indicates no statistical significance at the specified alpha level.

Examples:

control = [10, 12, 11, 13, 12]     # mean ≈ 11.6
treatment = [15, 17, 16, 18, 14]   # mean = 16.0
control <=> treatment              # => 1 (control is significantly different from treatment)
treatment <=> control              # => -1 (treatment is significantly different from control)
control <=> control                # => 0 (control is not significantly different from itself)

Parameters:

  • other (Enumerable)

    Another collection to compare against

  • alpha (Float) (defaults to: 0.05)

    Significance level (default: 0.05 for 95% confidence)

Returns:

  • (Integer)

    1 if this collection’s mean is significantly greater, -1 if this collection’s mean is significantly less, 0 if this collection’s mean is not significantly different

Since:

  • 0.1.0



212
213
214
215
216
217
218
219
220
# File 'lib/enumerable_stats/enumerable_ext.rb', line 212

def <=>(other, alpha: 0.05)
  if greater_than?(other, alpha: alpha)
    1
  elsif less_than?(other, alpha: alpha)
    -1
  else
    0
  end
end

#>(other, alpha: 0.05) ⇒ Boolean

Operator alias for greater_than? - tests if this collection’s mean is significantly greater

Examples:

baseline = [100, 110, 105, 115, 95]
optimized = [85, 95, 90, 100, 80]
baseline > optimized  # => true (baseline is significantly greater)

Parameters:

  • other (Enumerable)

    Another collection to compare against

  • alpha (Float) (defaults to: 0.05)

    The significance level (default: 0.05 for 95% confidence)

Returns:

  • (Boolean)

    true if this collection’s mean is significantly greater

Since:

  • 0.1.0



179
180
181
# File 'lib/enumerable_stats/enumerable_ext.rb', line 179

def >(other, alpha: 0.05)
  greater_than?(other, alpha: alpha)
end

#degrees_of_freedom(other) ⇒ Float

Calculates the degrees of freedom for comparing two samples using Welch’s formula This is used in statistical hypothesis testing when sample variances are unequal The formula accounts for different sample sizes and variances between groups

Examples:

sample_a = [10, 12, 14, 16, 18]
sample_b = [5, 15, 25, 35, 45, 55]
df = sample_a.degrees_of_freedom(sample_b)  # => ~7.2

Parameters:

  • other (Enumerable)

    Another collection to compare against

Returns:

  • (Float)

    Degrees of freedom for statistical testing

Since:

  • 0.1.0



118
119
120
121
122
123
124
125
126
127
128
# File 'lib/enumerable_stats/enumerable_ext.rb', line 118

def degrees_of_freedom(other)
  n1 = variance / count
  n2 = other.variance / other.count

  n = (n1 + n2)**2

  d1 = (variance**2) / ((count**2) * (count - 1))
  d2 = (other.variance**2) / ((other.count**2) * (other.count - 1))

  n / (d1 + d2)
end

#greater_than?(other, alpha: 0.05) ⇒ Boolean

Tests if this collection’s mean is significantly greater than another collection’s mean using a one-tailed Student’s t-test. Returns true if the test indicates statistical significance at the specified alpha level.

Examples:

control = [10, 12, 11, 13, 12]     # mean ≈ 11.6
treatment = [15, 17, 16, 18, 14]   # mean = 16.0
treatment.greater_than?(control)   # => true (treatment significantly > control)
control.greater_than?(treatment)   # => false

Parameters:

  • other (Enumerable)

    Another collection to compare against

  • alpha (Float) (defaults to: 0.05)

    Significance level (default: 0.05 for 95% confidence)

Returns:

  • (Boolean)

    True if this collection’s mean is significantly greater

Since:

  • 0.1.0



142
143
144
145
146
147
148
# File 'lib/enumerable_stats/enumerable_ext.rb', line 142

def greater_than?(other, alpha: 0.05)
  t_stat = t_value(other)
  df = degrees_of_freedom(other)
  critical_value = critical_t_value(df, alpha)

  t_stat > critical_value
end

#less_than?(other, alpha: 0.05) ⇒ Boolean

Tests if this collection’s mean is significantly less than another collection’s mean using a one-tailed Student’s t-test. Returns true if the test indicates statistical significance at the specified alpha level.

Examples:

control = [10, 12, 11, 13, 12]     # mean ≈ 11.6
treatment = [15, 17, 16, 18, 14]   # mean = 16.0
control.less_than?(treatment)      # => true (control significantly < treatment)
treatment.less_than?(control)      # => false

Parameters:

  • other (Enumerable)

    Another collection to compare against

  • alpha (Float) (defaults to: 0.05)

    Significance level (default: 0.05 for 95% confidence)

Returns:

  • (Boolean)

    True if this collection’s mean is significantly less

Since:

  • 0.1.0



162
163
164
165
166
167
168
# File 'lib/enumerable_stats/enumerable_ext.rb', line 162

def less_than?(other, alpha: 0.05)
  t_stat = t_value(other)
  df = degrees_of_freedom(other)
  critical_value = critical_t_value(df, alpha)

  t_stat < -critical_value
end

#meanFloat

Calculates the arithmetic mean (average) of the collection

Examples:

[1, 2, 3, 4, 5].mean  # => 3.0
(1..10).mean          # => 5.5

Returns:

  • (Float)

    The arithmetic mean of all numeric values

Since:

  • 0.1.0



228
229
230
# File 'lib/enumerable_stats/enumerable_ext.rb', line 228

def mean
  sum / size.to_f
end

#medianNumeric?

Calculates the median (middle value) of the collection For collections with an even number of elements, returns the average of the two middle values

Examples:

[1, 2, 3, 4, 5].median        # => 3
[1, 2, 3, 4].median           # => 2.5
[5, 1, 3, 2, 4].median        # => 3 (automatically sorts)
[].median                     # => nil

Returns:

  • (Numeric, nil)

    The median value, or nil if the collection is empty

Since:

  • 0.1.0



241
242
243
244
245
246
247
248
249
250
251
252
# File 'lib/enumerable_stats/enumerable_ext.rb', line 241

def median
  return nil if size.zero?

  sorted = sort
  midpoint = size / 2

  if size.even?
    sorted[midpoint - 1, 2].sum / 2.0
  else
    sorted[midpoint]
  end
end

#outlier_stats(multiplier: 1.5) ⇒ Hash

Returns statistics about outlier removal for debugging/logging Provides detailed information about how many outliers were removed and their percentage

Examples:

data = [1, 2, 3, 4, 5, 100]
stats = data.outlier_stats
# => {original_count: 6, filtered_count: 5, outliers_removed: 1, outlier_percentage: 16.67}

Parameters:

  • multiplier (Float) (defaults to: 1.5)

    IQR multiplier for outlier detection (1.5 is standard, 2.0 is more conservative)

Returns:

  • (Hash)

    Statistics hash containing :original_count, :filtered_count, :outliers_removed, :outlier_percentage

Since:

  • 0.1.0



382
383
384
385
386
387
388
389
390
391
392
# File 'lib/enumerable_stats/enumerable_ext.rb', line 382

def outlier_stats(multiplier: 1.5)
  original_count = size
  filtered = remove_outliers(multiplier: multiplier)

  {
    original_count: original_count,
    filtered_count: filtered.size,
    outliers_removed: original_count - filtered.size,
    outlier_percentage: ((original_count - filtered.size).to_f / original_count * 100).round(2)
  }
end

#percentage_difference(other) ⇒ Float

Calculates the percentage difference between this collection’s mean and another value or collection’s mean Uses the symmetric percentage difference formula: |a - b| / ((a + b) / 2) * 100 This is useful for comparing datasets or metrics where direction doesn’t matter

Parameters:

  • other (Numeric, Enumerable)

    Value or collection to compare against

Returns:

  • (Float)

    Absolute percentage difference (always positive)

Since:

  • 0.1.0



59
60
61
62
63
64
65
66
67
# File 'lib/enumerable_stats/enumerable_ext.rb', line 59

def percentage_difference(other)
  a = mean.to_f
  b = other.respond_to?(:mean) ? other.mean.to_f : other.to_f

  return 0.0 if a == b
  return Float::INFINITY if (a + b).zero?

  ((a - b).abs / ((a + b) / 2.0).abs) * 100
end

#percentile(percentile) ⇒ Numeric?

Calculates the specified percentile of the collection Uses linear interpolation between data points when the exact percentile falls between values This is equivalent to the “linear” method used by many statistical software packages

Examples:

[1, 2, 3, 4, 5].percentile(50)    # => 3 (same as median)
[1, 2, 3, 4, 5].percentile(25)    # => 2.0 (25th percentile)
[1, 2, 3, 4, 5].percentile(75)    # => 4.0 (75th percentile)
[1, 2, 3, 4, 5].percentile(0)     # => 1 (minimum value)
[1, 2, 3, 4, 5].percentile(100)   # => 5 (maximum value)
[].percentile(50)                 # => nil (empty collection)

Parameters:

  • percentile (Numeric)

    The percentile to calculate (0-100)

Returns:

  • (Numeric, nil)

    The value at the specified percentile, or nil if the collection is empty

Raises:

  • (ArgumentError)

    If percentile is not between 0 and 100

Since:

  • 0.1.0



268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
# File 'lib/enumerable_stats/enumerable_ext.rb', line 268

def percentile(percentile)
  return nil if size.zero?

  unless percentile.is_a?(Numeric) && percentile >= 0 && percentile <= 100
    raise ArgumentError, "Percentile must be a number between 0 and 100, got #{percentile}"
  end

  sorted = sort

  # Handle edge cases
  return sorted.first if percentile.zero?
  return sorted.last if percentile == 100

  # Calculate the position using the "linear" method (R-7/Excel method)
  # This is the most commonly used method in statistical software
  position = (size - 1) * (percentile / 100.0)

  # If position is an integer, return that exact element
  if position == position.floor
    sorted[position.to_i]
  else
    # Linear interpolation between the two surrounding values
    lower_index = position.floor
    upper_index = position.ceil
    weight = position - position.floor

    lower_value = sorted[lower_index]
    upper_value = sorted[upper_index]

    lower_value + (weight * (upper_value - lower_value))
  end
end

#remove_outliers(multiplier: 1.5) ⇒ Array

Removes extreme outliers using the IQR (Interquartile Range) method This is particularly effective for performance data which often has extreme values due to network issues, CPU scheduling, GC pauses, etc.

Parameters:

  • multiplier (Float) (defaults to: 1.5)

    IQR multiplier (1.5 is standard, 2.0 is more conservative)

Returns:

  • (Array)

    Array with outliers removed

Since:

  • 0.1.0



331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
# File 'lib/enumerable_stats/enumerable_ext.rb', line 331

def remove_outliers(multiplier: 1.5)
  return self if size < 4 # Need minimum data points for quartiles

  sorted = sort
  n = size

  # Use the standard quartile calculation with interpolation
  # Q1 position = (n-1) * 0.25
  # Q3 position = (n-1) * 0.75
  q1_pos = (n - 1) * 0.25
  q3_pos = (n - 1) * 0.75

  # Calculate Q1
  if q1_pos == q1_pos.floor
    q1 = sorted[q1_pos.to_i]
  else
    lower_index = q1_pos.floor
    upper_index = q1_pos.ceil
    weight = q1_pos - q1_pos.floor
    q1 = sorted[lower_index] + (weight * (sorted[upper_index] - sorted[lower_index]))
  end

  # Calculate Q3
  if q3_pos == q3_pos.floor
    q3 = sorted[q3_pos.to_i]
  else
    lower_index = q3_pos.floor
    upper_index = q3_pos.ceil
    weight = q3_pos - q3_pos.floor
    q3 = sorted[lower_index] + (weight * (sorted[upper_index] - sorted[lower_index]))
  end

  iqr = q3 - q1

  # Calculate bounds
  lower_bound = q1 - (multiplier * iqr)
  upper_bound = q3 + (multiplier * iqr)

  # Filter out outliers
  select { |value| value.between?(lower_bound, upper_bound) }
end

#signed_percentage_difference(other) ⇒ Float

Calculates the signed percentage difference between this collection’s mean and another value or collection’s mean Uses the signed percentage difference formula: (a - b) / ((a + b) / 2) * 100 Useful for performance comparisons where direction matters (e.g., improvements vs regressions)

Parameters:

  • other (Numeric, Enumerable)

    Value or collection to compare against

Returns:

  • (Float)

    Signed percentage difference (positive = this collection is higher, negative = other is higher)

Since:

  • 0.1.0



75
76
77
78
79
80
81
82
83
# File 'lib/enumerable_stats/enumerable_ext.rb', line 75

def signed_percentage_difference(other)
  a = mean.to_f
  b = other.respond_to?(:mean) ? other.mean.to_f : other.to_f

  return 0.0 if a == b
  return Float::INFINITY if (a + b).zero?

  ((a - b) / ((a + b) / 2.0).abs) * 100
end

#standard_deviationFloat

Calculates the sample standard deviation of the collection Returns the square root of the sample variance

Examples:

[1, 2, 3, 4, 5].standard_deviation    # => 1.58
[5, 5, 5, 5].standard_deviation       # => 0.0

Returns:

  • (Float)

    The sample standard deviation

Since:

  • 0.1.0



321
322
323
# File 'lib/enumerable_stats/enumerable_ext.rb', line 321

def standard_deviation
  Math.sqrt variance
end

#t_value(other) ⇒ Float

Calculates the t-statistic for comparing the means of two samples Uses Welch’s t-test formula which doesn’t assume equal variances A larger absolute t-value indicates a greater difference between sample means

Examples:

control = [10, 12, 11, 13, 12]
treatment = [15, 17, 16, 18, 14]
t_stat = control.t_value(treatment)  # => ~-4.2 (negative means treatment > control)

Parameters:

  • other (Enumerable)

    Another collection to compare against

Returns:

  • (Float)

    The t-statistic value (can be positive or negative)

Raises:

  • (ArgumentError)

Since:

  • 0.1.0



95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/enumerable_stats/enumerable_ext.rb', line 95

def t_value(other)
  raise ArgumentError, "Cannot compare with an empty collection" if empty? || other.empty?
  raise ArgumentError, "Parameter must be an Enumerable" unless other.respond_to?(:mean)

  signal = (mean - other.mean)
  noise = Math.sqrt(
    ((standard_deviation**2) / count) +
      ((other.standard_deviation**2) / other.count)
  )

  (signal / noise)
end

#varianceFloat

Calculates the sample variance of the collection Uses the unbiased formula with n-1 degrees of freedom (Bessel’s correction)

Examples:

[1, 2, 3, 4, 5].variance      # => 2.5
[5, 5, 5, 5].variance         # => 0.0 (no variation)

Returns:

  • (Float)

    The sample variance

Since:

  • 0.1.0



308
309
310
311
312
# File 'lib/enumerable_stats/enumerable_ext.rb', line 308

def variance
  mean = self.mean
  sum_of_squares = sum { |r| (r - mean)**2 }
  sum_of_squares / (count - 1).to_f
end