Module: Puppet::Util::Puppetdb::CharEncoding
- Defined in:
- lib/puppet/util/puppetdb/char_encoding.rb
Constant Summary collapse
- Utf8CharLens =
Some of this code is modeled after:
https://github.com/brianmario/utf8/blob/ef10c033/ext/utf8/utf8proc.c https://github.com/brianmario/utf8/blob/ef10c033/ext/utf8/string_utf8.c [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
- Utf8ReplacementChar =
[ 0xEF, 0xBF, 0xBD ].pack("c*")
- DEFAULT_INVALID_CHAR =
"\ufffd"- NOT_INVALID_REGEX =
Regexp.new( "[^" + DEFAULT_INVALID_CHAR + "]" )
Class Method Summary collapse
-
.coerce_to_utf8(str, error_context_str = nil) ⇒ Object
private
Attempts to coerce str to UTF-8, if that fails will output context information using error_context_str.
-
.error_char_context(str, bad_char_range) ⇒ Object
private
Scans the string str with invalid characters found at bad_char_range and returns a message that give some context around the bad characters.
-
.first_invalid_char_range(str) ⇒ Object
private
Finds the beginning and ending index of the first block of invalid characters.
- .utf8_string(str, error_context_str) ⇒ Object
- .warn_if_changed(str, converted_str) ⇒ Object private
-
.warn_if_invalid_chars(str, error_context_str) ⇒ Object
private
Warns the user if an invalid character was found.
Class Method Details
.coerce_to_utf8(str, error_context_str = nil) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Attempts to coerce str to UTF-8, if that fails will output context information using error_context_str
use in error messages. Defaults to nil, in which case no error is reported.
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 112 def self.coerce_to_utf8(str, error_context_str=nil) str_copy = str.dup # This code is passed in a string that was created by # to_pson. to_pson calls force_encoding('ASCII-8BIT') on the # string before it returns it. This leaves the actual UTF-8 bytes # alone. Below we check to see if this is the case (this should be # most common). In this case, the bytes are still UTF-8 and we can # just encode! and we're good to go. If They are not valid UTF-8 # bytes, that means there is probably some binary data mixed in # the middle of the UTF-8 string. In this case we need to output a # warning and give the user more information str_copy.force_encoding("UTF-8") if str_copy.valid_encoding? str_copy.encode!("UTF-8") else # This is force_encoded as US-ASCII to avoid any overlapping # byte related issues that could arise from mis-interpreting a # random extra byte as part of a multi-byte UTF-8 character str_copy.force_encoding("US-ASCII") str_lossy = str_copy.encode!("UTF-8", :invalid => :replace, :undef => :replace, :replace => DEFAULT_INVALID_CHAR) if !error_context_str.nil? warn_if_invalid_chars(str_lossy, error_context_str) else str_lossy end end end |
.error_char_context(str, bad_char_range) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Scans the string str with invalid characters found at bad_char_range and returns a message that give some context around the bad characters. This will give up to 100 characters prior to the bad character and 100 after. It will return fewer if it’s at the beginning of a string or if another bad character appears before reaching the 100 characters
67 68 69 70 71 72 73 74 75 76 77 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 67 def self.error_char_context(str, bad_char_range) gap = bad_char_range.to_a.length start_char = [0, bad_char_range.begin-100].max end_char = [str.index(DEFAULT_INVALID_CHAR, bad_char_range.end+1) || str.length, bad_char_range.end+100].min prefix = str[start_char..bad_char_range.begin-1] suffix = str[bad_char_range.end+1..end_char-1] "'#{prefix}' followed by #{gap} invalid/undefined bytes then '#{suffix}'" end |
.first_invalid_char_range(str) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Finds the beginning and ending index of the first block of invalid characters.
44 45 46 47 48 49 50 51 52 53 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 44 def self.first_invalid_char_range(str) begin_bad_chars_idx = str.index(DEFAULT_INVALID_CHAR) if begin_bad_chars_idx first_good_char = str.index(NOT_INVALID_REGEX, begin_bad_chars_idx) Range.new(begin_bad_chars_idx, (first_good_char || str.length) - 1) else nil end end |
.utf8_string(str, error_context_str) ⇒ Object
144 145 146 147 148 149 150 151 152 153 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 144 def self.utf8_string(str, error_context_str) begin coerce_to_utf8(str, error_context_str) rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError => e # If we got an exception, the string is either invalid or not # convertible to UTF-8, so drop those bytes. warn_if_changed(str, str.encode('UTF-8', :invalid => :replace, :undef => :replace)) end end |
.warn_if_changed(str, converted_str) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
156 157 158 159 160 161 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 156 def self.warn_if_changed(str, converted_str) if converted_str != str Puppet.warning "Ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB" end converted_str end |
.warn_if_invalid_chars(str, error_context_str) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Warns the user if an invalid character was found. If debugging is enabled will also log contextual information about where the bad character(s) were found
88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 88 def self.warn_if_invalid_chars(str, error_context_str) if str.index(DEFAULT_INVALID_CHAR).nil? str else Puppet.warning "#{error_context_str} ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB, see debug logging for more info" if Puppet.settings[:log_level] == "debug" Puppet.debug error_context_str + "\n" + error_char_context(str, first_invalid_char_range(str)) end str end end |