Module: Puppet::Util::Puppetdb::CharEncoding

Defined in:: lib/puppet/util/puppetdb/char_encoding.rb

Constant Summary collapse

Utf8CharLens = Some of this code is modeled after: https://github.com/brianmario/utf8/blob/ef10c033/ext/utf8/utf8proc.c https://github.com/brianmario/utf8/blob/ef10c033/ext/utf8/string_utf8.c

[
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
]

Utf8ReplacementChar =

[ 0xEF, 0xBF, 0xBD ].pack("c*")

DEFAULT_INVALID_CHAR =

"\ufffd"

NOT_INVALID_REGEX =

Regexp.new( "[^" + DEFAULT_INVALID_CHAR + "]" )

Class Method Summary collapse

.coerce_to_utf8(str, error_context_str = nil) ⇒ Object private

Attempts to coerce str to UTF-8, if that fails will output context information using error_context_str.
.error_char_context(str, bad_char_range) ⇒ Object private

Scans the string str with invalid characters found at bad_char_range and returns a message that give some context around the bad characters.
.first_invalid_char_range(str) ⇒ Object private

Finds the beginning and ending index of the first block of invalid characters.
.utf8_string(str, error_context_str) ⇒ Object
.warn_if_changed(str, converted_str) ⇒ Object private
.warn_if_invalid_chars(str, error_context_str) ⇒ Object private

Warns the user if an invalid character was found.

Class Method Details

.coerce_to_utf8(str, error_context_str = nil) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Attempts to coerce str to UTF-8, if that fails will output context information using error_context_str

use in error messages. Defaults to nil, in which case no error is reported.

Parameters:

str —

A string coming from to_pson, likely a command to be submitted to PDB
error_context_str (defaults to: nil) —

information about where this string came from for

Returns:

# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 112

def self.coerce_to_utf8(str, error_context_str=nil)
  str_copy = str.dup
  # This code is passed in a string that was created by
  # to_pson. to_pson calls force_encoding('ASCII-8BIT') on the
  # string before it returns it. This leaves the actual UTF-8 bytes
  # alone. Below we check to see if this is the case (this should be
  # most common). In this case, the bytes are still UTF-8 and we can
  # just encode! and we're good to go. If They are not valid UTF-8
  # bytes, that means there is probably some binary data mixed in
  # the middle of the UTF-8 string. In this case we need to output a
  # warning and give the user more information
  str_copy.force_encoding("UTF-8")
  if str_copy.valid_encoding?
    str_copy.encode!("UTF-8")
  else
    # This is force_encoded as US-ASCII to avoid any overlapping
    # byte related issues that could arise from mis-interpreting a
    # random extra byte as part of a multi-byte UTF-8 character
    str_copy.force_encoding("US-ASCII")

    str_lossy = str_copy.encode!("UTF-8",
                                 :invalid => :replace,
                                 :undef => :replace,
                                 :replace => DEFAULT_INVALID_CHAR)
    if !error_context_str.nil?
      warn_if_invalid_chars(str_lossy, error_context_str)
    else
      str_lossy
    end
  end
end

.error_char_context(str, bad_char_range) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Scans the string str with invalid characters found at bad_char_range and returns a message that give some context around the bad characters. This will give up to 100 characters prior to the bad character and 100 after. It will return fewer if it’s at the beginning of a string or if another bad character appears before reaching the 100 characters

Parameters:

str —

string coming from to_pson, likely a command to be submitted to PDB
bad_char_range —

a range indicating a block of invalid characters

Returns:

String

# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 67

def self.error_char_context(str, bad_char_range)

  gap = bad_char_range.to_a.length

  start_char = [0, bad_char_range.begin-100].max
  end_char = [str.index(DEFAULT_INVALID_CHAR, bad_char_range.end+1) || str.length, bad_char_range.end+100].min
  prefix = str[start_char..bad_char_range.begin-1]
  suffix = str[bad_char_range.end+1..end_char-1]

  "'#{prefix}' followed by #{gap} invalid/undefined bytes then '#{suffix}'"
end

.first_invalid_char_range(str) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Finds the beginning and ending index of the first block of invalid characters.

Parameters:

str —

string to scan for invalid characters

Returns:

Range

# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 44

def self.first_invalid_char_range(str)
  begin_bad_chars_idx = str.index(DEFAULT_INVALID_CHAR)

  if begin_bad_chars_idx
    first_good_char = str.index(NOT_INVALID_REGEX, begin_bad_chars_idx)
    Range.new(begin_bad_chars_idx, (first_good_char || str.length) - 1)
  else
    nil
  end
end

.utf8_string(str, error_context_str) ⇒ `Object`

# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 144

def self.utf8_string(str, error_context_str)
  begin
    coerce_to_utf8(str, error_context_str)
  rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError => e
    # If we got an exception, the string is either invalid or not
    # convertible to UTF-8, so drop those bytes.

    warn_if_changed(str, str.encode('UTF-8', :invalid => :replace, :undef => :replace))
  end
end

.warn_if_changed(str, converted_str) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 156

def self.warn_if_changed(str, converted_str)
  if converted_str != str
    Puppet.warning "Ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB"
  end
  converted_str
end

.warn_if_invalid_chars(str, error_context_str) ⇒ `Object`

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Warns the user if an invalid character was found. If debugging is enabled will also log contextual information about where the bad character(s) were found

Parameters:

str —

A string coming from to_pson, likely a command to be submitted to PDB
error_context_str —

information about where this string came from for use in error messages

Returns:

String

# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 88

def self.warn_if_invalid_chars(str, error_context_str)

  if str.index(DEFAULT_INVALID_CHAR).nil?
    str
  else
    Puppet.warning "#{error_context_str} ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB, see debug logging for more info"

    if Puppet.settings[:log_level] == "debug"
      Puppet.debug error_context_str + "\n" + error_char_context(str, first_invalid_char_range(str))
    end

    str
  end
end

Module: Puppet::Util::Puppetdb::CharEncoding

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.coerce_to_utf8(str, error_context_str = nil) ⇒ Object

.error_char_context(str, bad_char_range) ⇒ Object

.first_invalid_char_range(str) ⇒ Object

.utf8_string(str, error_context_str) ⇒ Object

.warn_if_changed(str, converted_str) ⇒ Object

.warn_if_invalid_chars(str, error_context_str) ⇒ Object

.coerce_to_utf8(str, error_context_str = nil) ⇒ `Object`

.error_char_context(str, bad_char_range) ⇒ `Object`

.first_invalid_char_range(str) ⇒ `Object`

.utf8_string(str, error_context_str) ⇒ `Object`

.warn_if_changed(str, converted_str) ⇒ `Object`

.warn_if_invalid_chars(str, error_context_str) ⇒ `Object`