Class: Aggregate

Inherits:

Object

Object
Aggregate

Defined in:: lib/aggregate.rb

Overview

Implements aggregate statistics and maintains configurable histogram for a set of given samples. Convenient for tracking high throughput data.

Constant Summary collapse

@@LOG_BUCKETS =

The number of buckets in the binary logarithmic histogram (low => 2**0, high => 2**@@LOG_BUCKETS)

Instance Attribute Summary collapse

#count ⇒ Object readonly

The current number of samples.
#max ⇒ Object readonly

The maximum sample value.
#mean ⇒ Object readonly

The current average of all samples.
#min ⇒ Object readonly

The minimum samples value.
#outliers_high ⇒ Object readonly

The number of samples falling above the highest valued histogram bucket.
#outliers_low ⇒ Object readonly

The number of samples falling below the lowest valued histogram bucket.
#sum ⇒ Object readonly

The sum of all samples.

Instance Method Summary collapse

#<<(data) ⇒ Object

Include a sample in the aggregate.
#each ⇒ Object

Iterate through each bucket in the histogram regardless of its contents.
#each_nonzero ⇒ Object

Iterate through only the buckets in the histogram that contain samples.
#initialize(low = nil, high = nil, width = nil) ⇒ Aggregate constructor

Create a new Aggregate that maintains a binary logarithmic histogram by default.
#skip_row(value_width) ⇒ Object

We denote empty buckets with a ‘~’.
#std_dev ⇒ Object

Calculate the standard deviation.
#to_s(columns = nil) ⇒ Object

Generate a pretty-printed ASCII representation of the histogram.

Constructor Details

#initialize(low = nil, high = nil, width = nil) ⇒ `Aggregate`

Create a new Aggregate that maintains a binary logarithmic histogram by default. Specifying values for low, high, and width configures the aggregate to maintain a linear histogram with (high - low)/width buckets

# File 'lib/aggregate.rb', line 32

def initialize (low=nil, high=nil, width=nil)
  @count = 0
  @sum = 0.0
  @sum2 = 0.0
  @outliers_low = 0
  @outliers_high = 0

  # If the user asks we maintain a linear histogram where
  # values in the range [low, high) are bucketed in multiples
  # of width
  if (nil != low && nil != high && nil != width)

    #Validate linear specification
    if high <= low
	raise ArgumentError, "High bucket must be > Low bucket"
    end

    if high - low < width
      raise ArgumentError, "Histogram width must be <= histogram range"
    end

    if 0 != (high - low).modulo(width)
	raise ArgumentError, "Histogram range (high - low) must be a multiple of width"
    end

    @low = low
    @high = high
    @width = width
  else
    @low = 1
    @high = to_bucket(@@LOG_BUCKETS - 1)
  end

  #Initialize all buckets to 0
  @buckets = Array.new(bucket_count, 0)
end

Instance Attribute Details

#count ⇒ `Object` (readonly)

The current number of samples



9
10
11

# File 'lib/aggregate.rb', line 9

def count
  @count
end

#max ⇒ `Object` (readonly)

The maximum sample value



12
13
14

# File 'lib/aggregate.rb', line 12

def max
  @max
end

#mean ⇒ `Object` (readonly)

The current average of all samples



6
7
8

# File 'lib/aggregate.rb', line 6

def mean
  @mean
end

#min ⇒ `Object` (readonly)

The minimum samples value



15
16
17

# File 'lib/aggregate.rb', line 15

def min
  @min
end

#outliers_high ⇒ `Object` (readonly)

The number of samples falling above the highest valued histogram bucket



24
25
26

# File 'lib/aggregate.rb', line 24

def outliers_high
  @outliers_high
end

#outliers_low ⇒ `Object` (readonly)

The number of samples falling below the lowest valued histogram bucket



21
22
23

# File 'lib/aggregate.rb', line 21

def outliers_low
  @outliers_low
end

#sum ⇒ `Object` (readonly)

The sum of all samples



18
19
20

# File 'lib/aggregate.rb', line 18

def sum
  @sum
end

Instance Method Details

#<<(data) ⇒ `Object`

Include a sample in the aggregate

# File 'lib/aggregate.rb', line 70

def << data

  # Update min/max
  if 0 == @count
    @min = data
    @max = data
  else
    @max = [data, @max].max
    @min = [data, @min].min
  end

  # Update the running info
  @count += 1
  @sum += data
  @sum2 += (data * data)

  # Update the bucket
  @buckets[to_index(data)] += 1 unless outlier?(data)
end

#each ⇒ `Object`

Iterate through each bucket in the histogram regardless of its contents

# File 'lib/aggregate.rb', line 188

def each
  @buckets.each_with_index do |count, index|
    yield(to_bucket(index), count)
  end
end

#each_nonzero ⇒ `Object`

Iterate through only the buckets in the histogram that contain samples

# File 'lib/aggregate.rb', line 196

def each_nonzero
  @buckets.each_with_index do |count, index|
    yield(to_bucket(index), count) if count != 0
  end
end

#skip_row(value_width) ⇒ `Object`

We denote empty buckets with a ‘~’



150
151
152

# File 'lib/aggregate.rb', line 150

def skip_row(value_width)
  sprintf("%#{value_width}s ~\n", " ")
end

#std_dev ⇒ `Object`

Calculate the standard deviation



95
96
97

# File 'lib/aggregate.rb', line 95

def std_dev
  Math.sqrt((@sum2.to_f - ((@sum.to_f * @sum.to_f)/@count.to_f)) / (@count.to_f - 1))
end

#to_s(columns = nil) ⇒ `Object`

Generate a pretty-printed ASCII representation of the histogram

# File 'lib/aggregate.rb', line 108

def to_s(columns=nil)

  #default to an 80 column terminal, don't support < 80 for now
  if nil == columns
    columns = 80
  else
    raise ArgumentError if columns < 80
  end

  #Find the largest bucket and create an array of the rows we intend to print
  disp_buckets = Array.new
  max_count = 0
  total = 0
  @buckets.each_with_index do |count, idx|
    next if 0 == count
    max_count = [max_count, count].max
    disp_buckets << [idx, to_bucket(idx), count]
    total += count
  end

  #XXX: Better to print just header --> footer
  return "Empty histogram" if 0 == disp_buckets.length

  #Figure out how wide the value and count columns need to be based on their
  #largest respective numbers
  value_str = "value"
  count_str = "count"
  total_str = "Total"
  value_width = [disp_buckets.last[1].to_s.length, value_str.length].max
  value_width = [value_width, total_str.length].max
  count_width = [total.to_s.length, count_str.length].max
  max_bar_width  = columns - (value_width + " |".length + "| ".length + count_width)

  #Determine the value of a '@'
  weight = [max_count.to_f/max_bar_width.to_f, 1.0].max

  #format the header
  histogram = sprintf("%#{value_width}s |", value_str)
  max_bar_width.times { histogram << "-"}
  histogram << sprintf("| %#{count_width}s\n", count_str)

  # We denote empty buckets with a '~'
  def skip_row(value_width)
    sprintf("%#{value_width}s ~\n", " ")
  end

  #Loop through each bucket to be displayed and output the correct number
  prev_index = disp_buckets[0][0] - 1

  disp_buckets.each do |x|
    #Denote skipped empty buckets with a ~
    histogram << skip_row(value_width) unless prev_index == x[0] - 1
    prev_index = x[0]

    #Add the value
    row = sprintf("%#{value_width}d |", x[1])

    #Add the bar
    bar_size = (x[2]/weight).to_i
    bar_size.times { row += "@"}
    (max_bar_width - bar_size).times { row += " " }

    #Add the count
    row << sprintf("| %#{count_width}d\n", x[2])

    #Append the finished row onto the histogram
    histogram << row
  end

  #End the table
  histogram << skip_row(value_width) if disp_buckets.last[0] != bucket_count-1
  histogram << sprintf("%#{value_width}s", "Total")
  histogram << " |"
  max_bar_width.times {histogram << "-"}
  histogram << "| "
  histogram << sprintf("%#{count_width}d\n", total)
end

Class: Aggregate

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(low = nil, high = nil, width = nil) ⇒ Aggregate

Instance Attribute Details

#count ⇒ Object (readonly)

#max ⇒ Object (readonly)

#mean ⇒ Object (readonly)

#min ⇒ Object (readonly)

#outliers_high ⇒ Object (readonly)

#outliers_low ⇒ Object (readonly)

#sum ⇒ Object (readonly)