Module: EncodingEstimator

Defined in:
lib/encoding_estimator.rb,
lib/encoding_estimator/version.rb,
lib/encoding_estimator/detector.rb,
lib/encoding_estimator/detection.rb,
lib/encoding_estimator/conversion.rb,
lib/encoding_estimator/distribution.rb,
lib/encoding_estimator/language_model.rb,
lib/encoding_estimator/parallel_support.rb,
lib/encoding_estimator/builder/model_builder.rb,
lib/encoding_estimator/builder/parallel_model_builder.rb

Defined Under Namespace

Classes: CDCombination, Conversion, Detection, Detector, Distribution, LanguageModel, ModelBuilder, ParallelModelBuilder, ParallelSupport, RangeScale, SingleDetectionResult

Constant Summary collapse

VERSION =
'0.2.0'

Class Method Summary collapse

Class Method Details

.detect(data, config) ⇒ EncodingEstimator::Detection

Let the EncodingEstimator detect how the input string is encoded

Parameters:

  • data (String)

    String to convert to UTF-8

  • languages (Array<Symbol>)

    List of languages the data might originate from, two-letter-codes, e.g. [:de, :en]

  • encodings (Array<String>)

    List of encodings to test, e.g. [ ‘UTF-8’, ‘ISO-8859-1’ ]. The order defines the priority when choosing from encodings with same detection score

  • operations (Array<Symbol>)

    Choose which operations (encoding to/decoding from an encoding to UTF-8) to test

  • penalty (Float)

    Penalty threshold to define when chars are weighted negative

  • num_cores (Integer)

    Number of threads to use for detection. Use “nil” to use single threaded implementation

  • include_default (Boolean)

    Include “keep as is” conversion when testing, e.g. check if the string is already UTF-8 encoded

Returns:



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/encoding_estimator.rb', line 50

def EncodingEstimator.detect( data, config )
  params = {
      languages:       [ :de, :en ],
      encodings:       %w(iso-8859-1 utf-16le windows-1251),
      operations:      [Conversion::Operation::DECODE],
      include_default: true,
      penalty:         0.01,
      num_cores:       nil
  }.merge config

  Detector.new(
      Conversion.generate( params[ :encodings ], params[ :operations ], params[ :include_default ] ),
      params[ :languages ].map { |l| EncodingEstimator::LanguageModel.new( l ) }, params[ :penalty ], params[:num_cores]
  ).detect data
end

.ensure_utf8(data, config = {}) ⇒ String

Convert a string to a UTF-8 string by performing the conversion that is automatically detected by EncodingEstimator

Parameters:

  • data (String)

    String to convert to UTF-8

  • languages (Array<Symbol|String>)

    List of languages the data might originate from, two-letter-codes, e.g. [:de, :en]

  • encodings (Array<String>)

    List of encodings to test, e.g. [ ‘UTF-8’, ‘ISO-8859-1’ ]. The order defines the priority when choosing from encodings with same detection score

  • operations (Array<Symbol>)

    Choose which operations (encoding to/decoding from an encoding to UTF-8) to test

  • penalty (Float)

    Penalty threshold to define when chars are weighted negative

  • num_cores (Integer)

    Number of threads to use for detection. Use “nil” to use single threaded implementation

  • include_default (Boolean)

    Include “keep as is” conversion when testing, e.g. check if the string is already UTF-8 encoded

Returns:

  • (String)

    UTF-8 string



23
24
25
26
27
28
29
30
31
32
33
34
35
# File 'lib/encoding_estimator.rb', line 23

def EncodingEstimator.ensure_utf8( data, config = {} )

  params = {
    languages:        [ :de, :en ],
    encodings:        %w(iso-8859-1 utf-16le windows-1251),
    operations:       [Conversion::Operation::DECODE],
    include_default:  true,
    penalty:          0.01,
    num_cores:        nil
  }.merge config

  EncodingEstimator.detect( data, params ).result.perform( data )
end