Class: EncodingEstimator::ModelBuilder

Inherits:
Object
  • Object
show all
Defined in:
lib/encoding_estimator/builder/model_builder.rb

Overview

Class which allows building language models (character count statistics) from a single file

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(filename) ⇒ ModelBuilder

Create a new object for a given file

Parameters:

  • filename (String)

    Path to the file to learn statistics from



12
13
14
# File 'lib/encoding_estimator/builder/model_builder.rb', line 12

def initialize( filename )
  @filename = filename
end

Instance Attribute Details

#filenameObject (readonly)

Returns the value of attribute filename.



7
8
9
# File 'lib/encoding_estimator/builder/model_builder.rb', line 7

def filename
  @filename
end

Class Method Details

.join_and_postprocess(stats_collection, min_char_threshold = 0.0001) ⇒ Hash

Combine multiple character count statistics to one single table. Also, characters occurring less often then a threshold are ignored. The final table is scaled linear (and mapped to a score of 1 to 10)

Parameters:

  • stats_collection (Array<Hash>)

    Array of character count statistics as returned by ModelBuilder.encode

  • min_char_threshold (Float) (defaults to: 0.0001)

    Threshold used to decide, which characters to include (include a char if count/max_count >= threshold)

Returns:

  • (Hash)

    Character count statistics, in linear scale, score from 1 to 10



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/encoding_estimator/builder/model_builder.rb', line 36

def self.join_and_postprocess( stats_collection, min_char_threshold = 0.0001 )
  stats     = {}
  log_stats = {}

  # Join all stats
  stats_collection.each do |stat|
    stat.each { |char, count| stats[char] = stats.fetch(char, 0) + count }
  end

  max_count = stats.values.max
  stats.each do |char, count|
    next if count < max_count * min_char_threshold

    log_stats[ char ] = ( 10.0 * count / max_count ).round( 6 )
  end

  log_stats
end

Instance Method Details

#executeHash

Count all characters in the file

Returns:

  • (Hash)

    Hash mapping each character found in the file to the number of occurrences



19
20
21
22
23
24
25
26
# File 'lib/encoding_estimator/builder/model_builder.rb', line 19

def execute
  content = load_content

  stats = {}
  content.each_char { |c| stats[c] = stats.fetch(c, 0) + 1 }

  stats
end