Module: EnumerableStats::EnumerableExt
- Included in:
- Enumerable
- Defined in:
- lib/enumerable_stats/enumerable_ext.rb
Overview
Extension module that adds statistical methods to all Enumerable objects.
This module provides essential statistical functions including measures of central tendency (mean, median), measures of dispersion (variance, standard deviation), percentile calculations, outlier detection using the IQR method, and statistical comparison methods.
When included, these methods become available on all Ruby collections that include Enumerable (Arrays, Ranges, Sets, etc.), enabling seamless statistical analysis without external dependencies.
Constant Summary collapse
- EPSILON =
Epsilon for floating point comparisons to avoid precision issues
1e-10
- COMMON_ALPHA_VALUES =
Common alpha levels with their corresponding high-precision z-scores Used to avoid floating point comparison issues while maintaining backward compatibility
{ 0.10 => 1.2815515655446004, 0.05 => 1.6448536269514722, 0.025 => 1.9599639845400545, 0.01 => 2.3263478740408408, 0.005 => 2.5758293035489004, 0.001 => 3.0902323061678132 }.freeze
- CORNISH_FISHER_FOURTH_ORDER_DENOMINATOR =
92_160.0
- EDGEWORTH_SMALL_SAMPLE_COEFF =
4.0
- BSM_THRESHOLD =
1e-20
Instance Method Summary collapse
-
#<(other, alpha: 0.05) ⇒ Boolean
Operator alias for less_than? - tests if this collection’s mean is significantly less.
-
#<=>(other, alpha: 0.05) ⇒ Integer
Tests if this collection’s mean is significantly different from another collection’s mean using a two-tailed Student’s t-test.
-
#>(other, alpha: 0.05) ⇒ Boolean
Operator alias for greater_than? - tests if this collection’s mean is significantly greater.
-
#degrees_of_freedom(other) ⇒ Float
Calculates the degrees of freedom for comparing two samples using Welch’s formula This is used in statistical hypothesis testing when sample variances are unequal The formula accounts for different sample sizes and variances between groups.
-
#greater_than?(other, alpha: 0.05) ⇒ Boolean
Tests if this collection’s mean is significantly greater than another collection’s mean using a one-tailed Student’s t-test.
-
#less_than?(other, alpha: 0.05) ⇒ Boolean
Tests if this collection’s mean is significantly less than another collection’s mean using a one-tailed Student’s t-test.
-
#mean ⇒ Float
Calculates the arithmetic mean (average) of the collection.
-
#median ⇒ Numeric?
Calculates the median (middle value) of the collection For collections with an even number of elements, returns the average of the two middle values.
-
#outlier_stats(multiplier: 1.5) ⇒ Hash
Returns statistics about outlier removal for debugging/logging Provides detailed information about how many outliers were removed and their percentage.
-
#percentage_difference(other) ⇒ Float
Calculates the percentage difference between this collection’s mean and another value or collection’s mean Uses the symmetric percentage difference formula: |a - b| / ((a + b) / 2) * 100 This is useful for comparing datasets or metrics where direction doesn’t matter.
-
#percentile(percentile) ⇒ Numeric?
Calculates the specified percentile of the collection Uses linear interpolation between data points when the exact percentile falls between values This is equivalent to the “linear” method used by many statistical software packages.
-
#remove_outliers(multiplier: 1.5) ⇒ Array
Removes extreme outliers using the IQR (Interquartile Range) method This is particularly effective for performance data which often has extreme values due to network issues, CPU scheduling, GC pauses, etc.
-
#signed_percentage_difference(other) ⇒ Float
Calculates the signed percentage difference between this collection’s mean and another value or collection’s mean Uses the signed percentage difference formula: (a - b) / ((a + b) / 2) * 100 Useful for performance comparisons where direction matters (e.g., improvements vs regressions).
-
#standard_deviation ⇒ Float
Calculates the sample standard deviation of the collection Returns the square root of the sample variance.
-
#t_value(other) ⇒ Float
Calculates the t-statistic for comparing the means of two samples Uses Welch’s t-test formula which doesn’t assume equal variances A larger absolute t-value indicates a greater difference between sample means.
-
#variance ⇒ Float
Calculates the sample variance of the collection Uses the unbiased formula with n-1 degrees of freedom (Bessel’s correction).
Instance Method Details
#<(other, alpha: 0.05) ⇒ Boolean
Operator alias for less_than? - tests if this collection’s mean is significantly less
192 193 194 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 192 def <(other, alpha: 0.05) less_than?(other, alpha: alpha) end |
#<=>(other, alpha: 0.05) ⇒ Integer
Tests if this collection’s mean is significantly different from another collection’s mean using a two-tailed Student’s t-test. Returns 1 if the test indicates statistical significance at the specified alpha level, -1 if the test indicates statistical significance at the specified alpha level, and 0 if the test indicates no statistical significance at the specified alpha level.
212 213 214 215 216 217 218 219 220 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 212 def <=>(other, alpha: 0.05) if greater_than?(other, alpha: alpha) 1 elsif less_than?(other, alpha: alpha) -1 else 0 end end |
#>(other, alpha: 0.05) ⇒ Boolean
Operator alias for greater_than? - tests if this collection’s mean is significantly greater
179 180 181 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 179 def >(other, alpha: 0.05) greater_than?(other, alpha: alpha) end |
#degrees_of_freedom(other) ⇒ Float
Calculates the degrees of freedom for comparing two samples using Welch’s formula This is used in statistical hypothesis testing when sample variances are unequal The formula accounts for different sample sizes and variances between groups
118 119 120 121 122 123 124 125 126 127 128 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 118 def degrees_of_freedom(other) n1 = variance / count n2 = other.variance / other.count n = (n1 + n2)**2 d1 = (variance**2) / ((count**2) * (count - 1)) d2 = (other.variance**2) / ((other.count**2) * (other.count - 1)) n / (d1 + d2) end |
#greater_than?(other, alpha: 0.05) ⇒ Boolean
Tests if this collection’s mean is significantly greater than another collection’s mean using a one-tailed Student’s t-test. Returns true if the test indicates statistical significance at the specified alpha level.
142 143 144 145 146 147 148 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 142 def greater_than?(other, alpha: 0.05) t_stat = t_value(other) df = degrees_of_freedom(other) critical_value = critical_t_value(df, alpha) t_stat > critical_value end |
#less_than?(other, alpha: 0.05) ⇒ Boolean
Tests if this collection’s mean is significantly less than another collection’s mean using a one-tailed Student’s t-test. Returns true if the test indicates statistical significance at the specified alpha level.
162 163 164 165 166 167 168 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 162 def less_than?(other, alpha: 0.05) t_stat = t_value(other) df = degrees_of_freedom(other) critical_value = critical_t_value(df, alpha) t_stat < -critical_value end |
#mean ⇒ Float
Calculates the arithmetic mean (average) of the collection
228 229 230 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 228 def mean sum / size.to_f end |
#median ⇒ Numeric?
Calculates the median (middle value) of the collection For collections with an even number of elements, returns the average of the two middle values
241 242 243 244 245 246 247 248 249 250 251 252 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 241 def median return nil if size.zero? sorted = sort midpoint = size / 2 if size.even? sorted[midpoint - 1, 2].sum / 2.0 else sorted[midpoint] end end |
#outlier_stats(multiplier: 1.5) ⇒ Hash
Returns statistics about outlier removal for debugging/logging Provides detailed information about how many outliers were removed and their percentage
382 383 384 385 386 387 388 389 390 391 392 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 382 def outlier_stats(multiplier: 1.5) original_count = size filtered = remove_outliers(multiplier: multiplier) { original_count: original_count, filtered_count: filtered.size, outliers_removed: original_count - filtered.size, outlier_percentage: ((original_count - filtered.size).to_f / original_count * 100).round(2) } end |
#percentage_difference(other) ⇒ Float
Calculates the percentage difference between this collection’s mean and another value or collection’s mean Uses the symmetric percentage difference formula: |a - b| / ((a + b) / 2) * 100 This is useful for comparing datasets or metrics where direction doesn’t matter
59 60 61 62 63 64 65 66 67 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 59 def percentage_difference(other) a = mean.to_f b = other.respond_to?(:mean) ? other.mean.to_f : other.to_f return 0.0 if a == b return Float::INFINITY if (a + b).zero? ((a - b).abs / ((a + b) / 2.0).abs) * 100 end |
#percentile(percentile) ⇒ Numeric?
Calculates the specified percentile of the collection Uses linear interpolation between data points when the exact percentile falls between values This is equivalent to the “linear” method used by many statistical software packages
268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 268 def percentile(percentile) return nil if size.zero? unless percentile.is_a?(Numeric) && percentile >= 0 && percentile <= 100 raise ArgumentError, "Percentile must be a number between 0 and 100, got #{percentile}" end sorted = sort # Handle edge cases return sorted.first if percentile.zero? return sorted.last if percentile == 100 # Calculate the position using the "linear" method (R-7/Excel method) # This is the most commonly used method in statistical software position = (size - 1) * (percentile / 100.0) # If position is an integer, return that exact element if position == position.floor sorted[position.to_i] else # Linear interpolation between the two surrounding values lower_index = position.floor upper_index = position.ceil weight = position - position.floor lower_value = sorted[lower_index] upper_value = sorted[upper_index] lower_value + (weight * (upper_value - lower_value)) end end |
#remove_outliers(multiplier: 1.5) ⇒ Array
Removes extreme outliers using the IQR (Interquartile Range) method This is particularly effective for performance data which often has extreme values due to network issues, CPU scheduling, GC pauses, etc.
331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 331 def remove_outliers(multiplier: 1.5) return self if size < 4 # Need minimum data points for quartiles sorted = sort n = size # Use the standard quartile calculation with interpolation # Q1 position = (n-1) * 0.25 # Q3 position = (n-1) * 0.75 q1_pos = (n - 1) * 0.25 q3_pos = (n - 1) * 0.75 # Calculate Q1 if q1_pos == q1_pos.floor q1 = sorted[q1_pos.to_i] else lower_index = q1_pos.floor upper_index = q1_pos.ceil weight = q1_pos - q1_pos.floor q1 = sorted[lower_index] + (weight * (sorted[upper_index] - sorted[lower_index])) end # Calculate Q3 if q3_pos == q3_pos.floor q3 = sorted[q3_pos.to_i] else lower_index = q3_pos.floor upper_index = q3_pos.ceil weight = q3_pos - q3_pos.floor q3 = sorted[lower_index] + (weight * (sorted[upper_index] - sorted[lower_index])) end iqr = q3 - q1 # Calculate bounds lower_bound = q1 - (multiplier * iqr) upper_bound = q3 + (multiplier * iqr) # Filter out outliers select { |value| value.between?(lower_bound, upper_bound) } end |
#signed_percentage_difference(other) ⇒ Float
Calculates the signed percentage difference between this collection’s mean and another value or collection’s mean Uses the signed percentage difference formula: (a - b) / ((a + b) / 2) * 100 Useful for performance comparisons where direction matters (e.g., improvements vs regressions)
75 76 77 78 79 80 81 82 83 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 75 def signed_percentage_difference(other) a = mean.to_f b = other.respond_to?(:mean) ? other.mean.to_f : other.to_f return 0.0 if a == b return Float::INFINITY if (a + b).zero? ((a - b) / ((a + b) / 2.0).abs) * 100 end |
#standard_deviation ⇒ Float
Calculates the sample standard deviation of the collection Returns the square root of the sample variance
321 322 323 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 321 def standard_deviation Math.sqrt variance end |
#t_value(other) ⇒ Float
Calculates the t-statistic for comparing the means of two samples Uses Welch’s t-test formula which doesn’t assume equal variances A larger absolute t-value indicates a greater difference between sample means
95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 95 def t_value(other) raise ArgumentError, "Cannot compare with an empty collection" if empty? || other.empty? raise ArgumentError, "Parameter must be an Enumerable" unless other.respond_to?(:mean) signal = (mean - other.mean) noise = Math.sqrt( ((standard_deviation**2) / count) + ((other.standard_deviation**2) / other.count) ) (signal / noise) end |
#variance ⇒ Float
Calculates the sample variance of the collection Uses the unbiased formula with n-1 degrees of freedom (Bessel’s correction)
308 309 310 311 312 |
# File 'lib/enumerable_stats/enumerable_ext.rb', line 308 def variance mean = self.mean sum_of_squares = sum { |r| (r - mean)**2 } sum_of_squares / (count - 1).to_f end |