Class: Wukong::Processor::Bin

Inherits:
Accumulator show all
Includes:
DynamicGet
Defined in:
lib/wukong/widget/reducers/bin.rb

Overview

A widget for binning input data. Will emit

This widget works nicely with the Extract widget at the end of a data flow:

Examples:

Binning some input data on the command-line


$ cat input
0.94628
0.03480
0.74418
...
$ cat input | wu-local bin --to=tsv

0.02935	0.12638500000000003	7
0.12638500000000003	0.22342000000000004	11
0.22342000000000004	0.32045500000000005	15

Control how the bins are defined and displayed


$ cat input | wu-local bin --min=0.0 --max=1.0 --num_bins=10 --precision=1 --to=tsv
0.0	0.1	10.0
0.1	0.2	12.0
0.2	0.3	8.0
...

Include an additional column of normalized (fractional) counts


$ cat input | wu-local bin --min=0.0 --max=1.0 --num_bins=10 --precision=1 --normalize --to=tsv
0.0	0.1	10.0	0.3
0.1	0.2	12.0	0.36
0.2	0.3	8.0	0.24
...

Make a log-log histogram


$ cat input | wu-local bin --log_bins --log_counts --to=tsv
1.000	3.162	1.099
3.162	10.000	1.946
10.000	31.623	3.045
31.623	100.000	4.234

Use the bin at the end of a dataflow


Wukong.processor(:bins_at_end) do
  ... | extract(part: 'age') | bin(num_bins: 10) | to_tsv
end

See Also:

Constant Summary

Constants inherited from Wukong::Processor

SerializerError

Instance Attribute Summary collapse

Attributes inherited from Accumulator

#group, #key

Attributes included from Hanuman::StageInstanceMethods

#graph

Instance Method Summary collapse

Methods included from DynamicGet

#get, #get_nested, included

Methods inherited from Accumulator

#process, #start

Methods inherited from Wukong::Processor

configure, description, #perform_action, #process, #receive_action, #stop

Methods included from Logging

included

Methods inherited from Hanuman::Stage

#clone

Methods included from Hanuman::StageClassMethods

#builder, #label, #register, #set_builder

Methods included from Hanuman::StageInstanceMethods

#add_link, #linkable_name, #root

Instance Attribute Details

#binsObject

The bins (pairs of edges)



129
130
131
# File 'lib/wukong/widget/reducers/bin.rb', line 129

def bins
  @bins
end

#countsObject

The value counts within each bin.



132
133
134
# File 'lib/wukong/widget/reducers/bin.rb', line 132

def counts
  @counts
end

#total_countObject

The total number of accumulated values.



135
136
137
# File 'lib/wukong/widget/reducers/bin.rb', line 135

def total_count
  @total_count
end

#valuesObject

The accumulated values



126
127
128
# File 'lib/wukong/widget/reducers/bin.rb', line 126

def values
  @values
end

Instance Method Details

#accumulate(record) ⇒ Object

Accumulates a single record.

First we extract the value from the record. If we already have bins, add the value to the appropriate bin. Otherwise, store the value, updating any properties like max or min as necessary.

Parameters:

  • record (Object)


169
170
171
172
173
174
175
176
177
178
179
180
181
# File 'lib/wukong/widget/reducers/bin.rb', line 169

def accumulate record
  value = (value_from(record) or return)
  self.total_count += 1
  if bins?
    add_to_some_bin(value)
  else
    self.min ||= value
    self.min = value if value < min
    self.max ||= value
    self.max = value if value > max
    self.values << value
  end
end

#bin!Object

Bins the accumulated values.

See Also:



233
234
235
236
237
238
239
240
# File 'lib/wukong/widget/reducers/bin.rb', line 233

def bin!
  set_num_bins_from_total_count! unless self.num_bins
  set_edges_from_min_max_and_num_bins!
  until values.empty?
    value = values.shift
    add_to_some_bin(value.to_f) if value
  end
end

#bins?true, false

Does this widget have a populated list of bins?

Returns:

  • (true, false)


245
246
247
# File 'lib/wukong/widget/reducers/bin.rb', line 245

def bins?
  bins && (! bins.empty?)
end

#finalize {|lower, upper, count, normalized_count| ... } ⇒ Object

Emits each bin with its edges and count. Adds the normalized count if requested.

Will bins the values if we haven't done so on the fly already.

Yields:

  • (lower, upper, count, normalized_count)

Yield Parameters:

  • lower (String)

    the lower (left) edge of the bin

  • upper (String)

    the upper (right) edge of the bin

  • count (String)

    the (logarithmic if requested) count of values in the bin

  • normalized_count (String)

    the (logarithmic if requested) normalized count of values in the bin if requested



193
194
195
196
197
198
199
200
201
202
203
# File 'lib/wukong/widget/reducers/bin.rb', line 193

def finalize
  bin! unless bins?
  counts.each_with_index do |count, index|
    bin  = bins[index]
    bin << log_count_if_necessary(count)
    if normalize && total_count > 0
      bin << log_count_if_necessary((count.to_f / total_count.to_f))
    end
    yield bin.map { |n| format(n) }
  end
end

#format(n) ⇒ String

Formats n so it's readable and compact.

If this widget is given an explicit format_string then it will be used here (the value of format_string should have a slot for a float).

Otherwise, large (or small) numbers will be formatted in scientific notation while "medium numbers" (0.001 < |n| < 1000) are merely printed, all with the given precision.

Parameters:

  • n (Float)

Returns:



217
218
219
220
221
222
223
224
225
226
227
228
# File 'lib/wukong/widget/reducers/bin.rb', line 217

def format n
  case
  when format_string
    format_string % n
  when n == 0.0
    '0.0'
  when n.abs > 1000 || n.abs < 0.001
    "%#{precision}.#{precision}E" % n
  else
    "%#{precision}.#{precision}f" % n
  end
end

#get_key(record) ⇒ :__first__group__

Keep all records in the same "group", at least from the Accumulator's perspective.

Parameters:

  • record (Object)

Returns:

  • (:__first__group__)


157
158
159
# File 'lib/wukong/widget/reducers/bin.rb', line 157

def get_key record
  :__first__group__
end

#log_count_if_necessary(val) ⇒ Float

Returns val, taking a logarithm to the appropriate base if required.

Parameters:

  • val (Float)

Returns:

  • (Float)

    the original value or its logarithm if required



264
265
266
# File 'lib/wukong/widget/reducers/bin.rb', line 264

def log_count_if_necessary val
  log_counts ? log_if_possible(val) : val
end

#log_if_possible(val) ⇒ Float

Returns the logarithm of the given val if possible.

Will return the original value if negative.

Parameters:

  • val (Float)

Returns:

  • (Float)


274
275
276
# File 'lib/wukong/widget/reducers/bin.rb', line 274

def log_if_possible val
  val > 0 ? Math.log(val, base) : val
end

#setupObject

Initializes all storage. If we can calculate bins in advance, do so now.



139
140
141
142
143
144
145
146
147
148
149
150
# File 'lib/wukong/widget/reducers/bin.rb', line 139

def setup
  super()
  self.values      = []
  self.bins        = []
  self.counts      = []
  self.total_count = 0
  if edges.nil?
    set_edges_from_min_max_and_num_bins! if min && max && num_bins
  else
    set_bins_and_counts_from_edges!
  end
end

#value_from(record) ⇒ Float?

Get a value from a given record.

Parameters:

  • record (Object)

Returns:

  • (Float, nil)


253
254
255
256
257
# File 'lib/wukong/widget/reducers/bin.rb', line 253

def value_from record
  val = get(self.by, record)
  return unless val
  val.to_f rescue nil
end