Module: Puppet::Util::CharacterEncoding

Defined in:
lib/puppet/util/character_encoding.rb

Overview

A module to centralize heuristics/practices for managing character encoding in Puppet

Class Method Summary collapse

Class Method Details

.convert_to_utf_8!(string) ⇒ String

Warning! This is a destructive method - the string supplied is modified!

Parameters:

  • string (String)

    a string to transcode / force_encode to utf-8

Returns:

  • (String)

    string if already utf-8, OR the same string with external encoding set to utf-8 if bytes are valid utf-8 OR the same string transcoded to utf-8 OR nil upon a failure to legitimately set external encoding or transcode string



12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# File 'lib/puppet/util/character_encoding.rb', line 12

def convert_to_utf_8!(string)
  currently_valid = string.valid_encoding?

  begin
    if string.encoding == Encoding::UTF_8
      if currently_valid
        return string
      else
        # If a string is currently believed to be UTF-8, but is also not
        # valid_encoding?, we have no recourse but to fail because we have no
        # idea what encoding this string originally came from where it *was*
        # valid - all we know is it's not currently valid UTF-8.
        raise EncodingError
      end
    elsif valid_utf_8_bytes?(string)
      # Before we try to transcode the string, check if it is valid UTF-8 as
      # currently constitued (in its non-UTF-8 encoding), and if it is, limit
      # ourselves to setting the external encoding of the string to UTF-8
      # rather than actually transcoding it. We do this to handle
      # a couple scenarios:

      # The first scenario is that the string was originally valid UTF-8 but
      # the current puppet run is not in a UTF-8 environment. In this case,
      # the string will likely have invalid byte sequences (i.e.,
      # string.valid_encoding? == false), and attempting to transcode will
      # fail with Encoding::InvalidByteSequenceError, referencing the
      # invalid byte sequence in the original, pre-transcode, string. We
      # might have gotten here, for example, if puppet is run first in a
      # user context with UTF-8 encoding (setting the "is" value to UTF-8)
      # and then later run via cron without UTF-8 specified, resulting in in
      # EN_US (ISO-8859-1) on many systems. In this scenario we're
      # effectively best-guessing this string originated as UTF-8 and only
      # set external encoding to UTF-8 - transcoding would have failed
      # anyway.

      # The second scenario (more rare, I expect) is that this string does
      # NOT have invalid byte sequences (string.valid_encoding? == true),
      # but is *ALSO valid unicode*.
      # Our example case is "\u16A0" - "RUNIC LETTER FEHU FEOH FE"
      # http://www.fileformat.info/info/unicode/char/16A0/index.htm
      # 0xE1 0x9A 0xA0 / 225 154 160
      # These bytes are valid in ISO-8859-1 but the character they represent
      # transcodes cleanly in ruby to *different* characters in UTF-8.
      # That's not what we want if the user intended the original string as
      # UTF-8. We can only guess, so if the string is valid UTF-8 as
      # currently constituted, we default to assuming the string originated
      # in UTF-8 and do not transcode it - we only set external encoding.
      return string.force_encoding(Encoding::UTF_8)
    elsif currently_valid
      # If the string is not currently valid UTF-8 but it can be transcoded
      # (it is valid in its current encoding), we can guess this string was
      # not originally unicode. Transcode it to UTF-8. For strings with
      # original encodings like SHIFT_JIS, this should be the final result.
      return string.encode!(Encoding::UTF_8)
    else
      # If the string is neither valid UTF-8 as-is nor valid in its current
      # encoding, fail. It requires user remediation.
      raise EncodingError
    end
  rescue EncodingError => detail
    # Catch both our own self-determined failure to transcode as well as any
    # error on ruby's part, ie Encoding::UndefinedConversionError on a
    # failure to encode!.
    Puppet.debug(_("%{error}: %{value} is not valid UTF-8 and cannot be transcoded by Puppet.") %
      { error: detail.inspect, value: string.dump })
    return nil
  end
end