Class: EncodingEstimator::ModelBuilder
- Inherits:
-
Object
- Object
- EncodingEstimator::ModelBuilder
- Defined in:
- lib/encoding_estimator/builder/model_builder.rb
Overview
Class which allows building language models (character count statistics) from a single file
Instance Attribute Summary collapse
-
#filename ⇒ Object
readonly
Returns the value of attribute filename.
Class Method Summary collapse
-
.join_and_postprocess(stats_collection, min_char_threshold = 0.0001) ⇒ Hash
Combine multiple character count statistics to one single table.
Instance Method Summary collapse
-
#execute ⇒ Hash
Count all characters in the file.
-
#initialize(filename) ⇒ ModelBuilder
constructor
Create a new object for a given file.
Constructor Details
#initialize(filename) ⇒ ModelBuilder
Create a new object for a given file
12 13 14 |
# File 'lib/encoding_estimator/builder/model_builder.rb', line 12 def initialize( filename ) @filename = filename end |
Instance Attribute Details
#filename ⇒ Object (readonly)
Returns the value of attribute filename.
7 8 9 |
# File 'lib/encoding_estimator/builder/model_builder.rb', line 7 def filename @filename end |
Class Method Details
.join_and_postprocess(stats_collection, min_char_threshold = 0.0001) ⇒ Hash
Combine multiple character count statistics to one single table. Also, characters occurring less often then a threshold are ignored. The final table is scaled linear (and mapped to a score of 1 to 10)
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/encoding_estimator/builder/model_builder.rb', line 36 def self.join_and_postprocess( stats_collection, min_char_threshold = 0.0001 ) stats = {} log_stats = {} # Join all stats stats_collection.each do |stat| stat.each { |char, count| stats[char] = stats.fetch(char, 0) + count } end max_count = stats.values.max stats.each do |char, count| next if count < max_count * min_char_threshold log_stats[ char ] = ( 10.0 * count / max_count ).round( 6 ) end log_stats end |
Instance Method Details
#execute ⇒ Hash
Count all characters in the file
19 20 21 22 23 24 25 26 |
# File 'lib/encoding_estimator/builder/model_builder.rb', line 19 def execute content = load_content stats = {} content.each_char { |c| stats[c] = stats.fetch(c, 0) + 1 } stats end |