Module: EncodingEstimator

Defined in:: lib/encoding_estimator.rb,
lib/encoding_estimator/version.rb,
lib/encoding_estimator/detector.rb,
lib/encoding_estimator/detection.rb,
lib/encoding_estimator/conversion.rb,
lib/encoding_estimator/distribution.rb,
lib/encoding_estimator/language_model.rb,
lib/encoding_estimator/parallel_support.rb,
lib/encoding_estimator/builder/model_builder.rb,
lib/encoding_estimator/builder/parallel_model_builder.rb

Defined Under Namespace

Classes: CDCombination, Conversion, Detection, Detector, Distribution, LanguageModel, ModelBuilder, ParallelModelBuilder, ParallelSupport, RangeScale, SingleDetectionResult

Constant Summary collapse

VERSION =

'0.2.0'

Class Method Summary collapse

.detect(data, config) ⇒ EncodingEstimator::Detection

Let the EncodingEstimator detect how the input string is encoded.
.ensure_utf8(data, config = {}) ⇒ String

Convert a string to a UTF-8 string by performing the conversion that is automatically detected by EncodingEstimator.

Class Method Details

.detect(data, config) ⇒ `EncodingEstimator::Detection`

Let the EncodingEstimator detect how the input string is encoded

Parameters:

data (String) —

String to convert to UTF-8
languages (Array<Symbol>) —

List of languages the data might originate from, two-letter-codes, e.g. [:de, :en]
encodings (Array<String>) —

List of encodings to test, e.g. [ ‘UTF-8’, ‘ISO-8859-1’ ]. The order defines the priority when choosing from encodings with same detection score
operations (Array<Symbol>) —

Choose which operations (encoding to/decoding from an encoding to UTF-8) to test
penalty (Float) —

Penalty threshold to define when chars are weighted negative
num_cores (Integer) —

Number of threads to use for detection. Use “nil” to use single threaded implementation
include_default (Boolean) —

Include “keep as is” conversion when testing, e.g. check if the string is already UTF-8 encoded

Returns:

(EncodingEstimator::Detection) —

Detection result with scores for all conversions

# File 'lib/encoding_estimator.rb', line 50

def EncodingEstimator.detect( data, config )
  params = {
      languages:       [ :de, :en ],
      encodings:       %w(iso-8859-1 utf-16le windows-1251),
      operations:      [Conversion::Operation::DECODE],
      include_default: true,
      penalty:         0.01,
      num_cores:       nil
  }.merge config

  Detector.new(
      Conversion.generate( params[ :encodings ], params[ :operations ], params[ :include_default ] ),
      params[ :languages ].map { |l| EncodingEstimator::LanguageModel.new( l ) }, params[ :penalty ], params[:num_cores]
  ).detect data
end

.ensure_utf8(data, config = {}) ⇒ `String`

Convert a string to a UTF-8 string by performing the conversion that is automatically detected by EncodingEstimator