Module: Puppet::Util::Puppetdb::CharEncoding
- Defined in:
- lib/puppet/util/puppetdb/char_encoding.rb
Constant Summary collapse
- Utf8CharLens =
Some of this code is modeled after:
https://github.com/brianmario/utf8/blob/ef10c033/ext/utf8/utf8proc.c https://github.com/brianmario/utf8/blob/ef10c033/ext/utf8/string_utf8.c [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
- Utf8ReplacementChar =
[ 0xEF, 0xBF, 0xBD ].pack("c*")
Class Method Summary collapse
- .get_byte(str, index) ⇒ Object private
- .get_char_len(byte) ⇒ Object private
- .iconv_to_utf8(str) ⇒ Object private
- .is_valid_multibyte_suffix(byte, additional_bytes) ⇒ Object private
- .ruby18_clean_utf8(str) ⇒ Object private
- .ruby18_handle_multibyte_char(result_str, byte, str, i, char_len, strip = true) ⇒ Object private
-
.ruby18_manually_clean_utf8(str, strip = true) ⇒ Object
private
Manually cleans a string by stripping any byte sequences that are not valid UTF-8 characters.
- .utf8_string(str) ⇒ Object
- .warn_if_changed(str, converted_str) ⇒ Object private
Class Method Details
.get_byte(str, index) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
197 198 199 200 201 202 203 204 205 206 207 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 197 def self.get_byte(str, index) # This method is a hack to allow this code to work with either ruby 1.8 # or 1.9. In production this code path should never be exercised by # 1.9 because it has a much more sane way to accomplish our goal, but # for testing, it is useful to be able to run the 1.8 codepath in 1.9. if @has_get_byte str.getbyte(index) else str[index] end end |
.get_char_len(byte) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
107 108 109 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 107 def self.get_char_len(byte) Utf8CharLens[byte] end |
.iconv_to_utf8(str) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
we’re not using this anymore, but I wanted to leave it around for a little while just to make sure that the new code pans out.
99 100 101 102 103 104 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 99 def self.iconv_to_utf8(str) iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8') # http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/ iconv.iconv(str + " ")[0..-2] end |
.is_valid_multibyte_suffix(byte, additional_bytes) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
182 183 184 185 186 187 188 189 190 191 192 193 194 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 182 def self.is_valid_multibyte_suffix(byte, additional_bytes) # This is heinous, but the UTF-8 spec says that codepoints greater than # 0x10FFFF are illegal. The first character that is over that limit is # 0xF490bfbf, so if the first byte is F4 then we have to check for # that condition. if byte == 0xF4 val = additional_bytes.inject(0) { |result, b | (result << 8) + b} if val >= 0x90bfbf return false end end additional_bytes.all? { |b| ((b & 0xC0) == 0x80) } end |
.ruby18_clean_utf8(str) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 79 def self.ruby18_clean_utf8(str) #iconv_to_utf8(str) #ruby18_manually_clean_utf8(str) # So, we've tried doing this UTF8 cleaning for ruby 1.8 a few different # ways. Doing it via IConv, we don't do a good job of handling characters # whose codepoints would exceed the legal maximum for UTF-8. Doing it via # our manual scrubbing process is slower and doesn't catch overlong # encodings. Since this code really shouldn't even exist in the first place # we've decided to simply compose the two scrubbing methods for now, rather # than trying to add detection of overlong encodings. It'd be a non-trivial # chunk of code, and it'd have to do a lot of bitwise arithmetic (which Ruby # is not blazingly fast at). ruby18_manually_clean_utf8(iconv_to_utf8(str)) end |
.ruby18_handle_multibyte_char(result_str, byte, str, i, char_len, strip = true) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 154 def self.ruby18_handle_multibyte_char(result_str, byte, str, i, char_len, strip = true) # keeping an array of bytes for now because we need to do some # bitwise math on them. char_additional_bytes = [] # If we don't have enough bytes left to read the full character, we # put on a replacement character and bail. if i + (char_len - 1) > str.length result_str.concat(Utf8ReplacementChar) unless strip return end # we've already read the first byte, so we need to set up a range # from 0 to (n-2); e.g. if it's a 2-byte char, we will have a range # from 0 to 0 which will result in reading 1 more byte (0..char_len - 2).each do |x| char_additional_bytes << get_byte(str, i + x) end if (is_valid_multibyte_suffix(byte, char_additional_bytes)) result_str << byte result_str.concat(char_additional_bytes.pack("c*")) else result_str.concat(Utf8ReplacementChar) unless strip end end |
.ruby18_manually_clean_utf8(str, strip = true) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Manually cleans a string by stripping any byte sequences that are not valid UTF-8 characters. If you’d prefer for the invalid bytes to be replaced with the unicode replacement character rather than being stripped, you may pass ‘false` for the optional second parameter (`strip`, which defaults to `true`).
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 118 def self.ruby18_manually_clean_utf8(str, strip = true) # This is a hack to allow this code to work with either ruby 1.8 or 1.9, # which is useful for debugging and benchmarking. For more info see the # comments in the #get_byte method below. @has_get_byte = str.respond_to?(:getbyte) i = 0 len = str.length result = "" while i < len byte = get_byte(str, i) i += 1 char_len = get_char_len(byte) case char_len when 0 result.concat(Utf8ReplacementChar) unless strip when 1 result << byte when 2..4 ruby18_handle_multibyte_char(result, byte, str, i, char_len, strip) i += char_len - 1 else raise Puppet::DevError, "Unhandled UTF8 char length: '#{char_len}'" end end result end |
.utf8_string(str) ⇒ Object
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 35 def self.utf8_string(str) if RUBY_VERSION =~ /1.8/ # Ruby 1.8 doesn't have String#encode and related methods, and there # appears to be a bug in iconv that will interpret some byte sequences # as 6-byte characters. Thus, we are forced to resort to some unfortunate # manual chicanery. warn_if_changed(str, ruby18_clean_utf8(str)) elsif str.encoding == Encoding::UTF_8 # If we get here, we're in ruby 1.9+, so we have the string encoding methods # available. However, just because a ruby String object is already # marked as UTF-8, that doesn't guarantee that its contents are actually # valid; and if you call ruby's ".encode" method with an encoding of # "utf-8" for a String that ruby already believes is UTF-8, ruby # seems to optimize that to be a no-op. So, we have to do some more # complex handling... # If the string already has valid encoding then we're fine. return str if str.valid_encoding? # If not, we basically have to walk over the characters and replace # them by hand. warn_if_changed(str, str.each_char.map { |c| c.valid_encoding? ? c : "\ufffd"}.join) else # if we get here, we're ruby 1.9 and the current string is *not* encoded # as UTF-8. Thus we can actually rely on ruby's "encode" method. begin str.encode('UTF-8') rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError => e # If we got an exception, the string is either invalid or not # convertible to UTF-8, so drop those bytes. warn_if_changed(str, str.encode('UTF-8', :invalid => :replace, :undef => :replace)) end end end |
.warn_if_changed(str, converted_str) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
71 72 73 74 75 76 |
# File 'lib/puppet/util/puppetdb/char_encoding.rb', line 71 def self.warn_if_changed(str, converted_str) if converted_str != str Puppet.warning "Ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB" end converted_str end |