Method: Rex::Text.to_unicode

Defined in:: lib/rex/text.rb

.to_unicode(str = '', type = 'utf-16le', mode = '', size = '') ⇒ `Object`

Converts standard ASCII text to a unicode string.

Supported unicode types include: utf-16le, utf16-be, utf32-le, utf32-be, utf-7, and utf-8

Providing ‘mode’ provides hints to the actual encoder as to how it should encode the string.

Only UTF-7 and UTF-8 use “mode”.

utf-7 by default does not encode alphanumeric and a few other characters. By specifying the mode of “all”, then all of the characters are encoded, not just the non-alphanumeric set. to_unicode(str, ‘utf-7’, ‘all’)

utf-8 specifies that alphanumeric characters are used directly, eg “a” is just “a”. However, there exist 6 different overlong encodings of “a” that are technically not valid, but parse just fine in most utf-8 parsers. (0xC1A1, 0xE081A1, 0xF08081A1, 0xF8808081A1, 0xFC80808081A1, 0xFE8080808081A1). How many bytes to use for the overlong enocding is specified providing ‘size’. to_unicode(str, ‘utf-8’, ‘overlong’, 2)

Many utf-8 parsers also allow invalid overlong encodings, where bits that are unused when encoding a single byte are modified. Many parsers will ignore these bits, rendering simple string matching to be ineffective for dealing with UTF-8 strings. There are many more invalid overlong encodings possible for “a”. For example, three encodings are available for an invalid 2 byte encoding of “a”. (0xC1E1 0xC161 0xC121).

By specifying “invalid”, a random invalid encoding is chosen for the given byte size. to_unicode(str, ‘utf-8’, ‘invalid’, 2)

utf-7 defaults to ‘normal’ utf-7 encoding utf-8 defaults to 2 byte ‘normal’ encoding

# File 'lib/rex/text.rb', line 592

def self.to_unicode(str='', type = 'utf-16le', mode = '', size = '')
  return '' if not str
  case type
  when 'utf-16le'
    return str.unpack('C*').pack('v*')
  when 'utf-16be'
    return str.unpack('C*').pack('n*')
  when 'utf-32le'
    return str.unpack('C*').pack('V*')
  when 'utf-32be'
    return str.unpack('C*').pack('N*')
  when 'utf-7'
    case mode
    when 'all'
      return str.gsub(/./){ |a|
        out = ''
        if 'a' != '+'
          out = encode_base64(to_unicode(a, 'utf-16be')).gsub(/[=\r\n]/, '')
        end
        '+' + out + '-'
      }
    else
      return str.gsub(/[^\n\r\t\ A-Za-z0-9\'\(\),-.\/\:\?]/){ |a|
        out = ''
        if a != '+'
          out = encode_base64(to_unicode(a, 'utf-16be')).gsub(/[=\r\n]/, '')
        end
        '+' + out + '-'
      }
    end
  when 'utf-8'
    if size == ''
      size = 2
    end

    if size >= 2 and size <= 7
      string = ''
      str.each_byte { |a|
        if (a < 21 || a > 0x7f) || mode != ''
          # ugh.	turn a single byte into the binary representation of it, in array form
          bin = [a].pack('C').unpack('B8')[0].split(//)

          # even more ugh.
          bin.collect!{|a_| a_.to_i}

          out = Array.new(8 * size, 0)

          0.upto(size - 1) { |i|
            out[i] = 1
            out[i * 8] = 1
          }

          i = 0
          byte = 0
          bin.reverse.each { |bit|
            if i < 6
              mod = (((size * 8) - 1) - byte * 8) - i
              out[mod] = bit
            else
              byte = byte + 1
              i = 0
              redo
            end
            i = i + 1
          }

          if mode != ''
            case mode
            when 'overlong'
              # do nothing, since we already handle this as above...
            when 'invalid'
              done = 0
              while done == 0
                # the ghetto...
                bits = [7, 8, 15, 16, 23, 24, 31, 32, 41]
                bits.each { |bit|
                  bit = (size * 8) - bit
                  if bit > 1
                    set = rand(2)
                    if out[bit] != set
                      out[bit] = set
                      done = 1
                    end
                  end
                }
              end
            else
              raise TypeError, 'Invalid mode.  Only "overlong" and "invalid" are acceptable modes for utf-8'
            end
          end
          string << [out.join('')].pack('B*')
        else
          string << [a].pack('C')
        end
      }
      return string
    else
      raise TypeError, 'invalid utf-8 size'
    end
  when 'uhwtfms' # suggested name from HD :P
    load_codepage()

    string = ''
    # overloading mode as codepage
    if mode == ''
      mode = 1252 # ANSI - Latan 1, default for US installs of MS products
    else
      mode = mode.to_i
    end
    if @@codepage_map_cache[mode].nil?
      raise TypeError, "Invalid codepage #{mode}"
    end
    str.each_byte {|byte|
      char = [byte].pack('C*')
      possible = @@codepage_map_cache[mode]['data'][char]
      if possible.nil?
        raise TypeError, "codepage #{mode} does not provide an encoding for 0x#{char.unpack('H*')[0]}"
      end
      string << possible[ rand(possible.length) ]
    }
    return string
  when 'uhwtfms-half' # suggested name from HD :P
    load_codepage()
    string = ''
    # overloading mode as codepage
    if mode == ''
      mode = 1252 # ANSI - Latan 1, default for US installs of MS products
    else
      mode = mode.to_i
    end
    if mode != 1252
      raise TypeError, "Invalid codepage #{mode}, only 1252 supported for uhwtfms_half"
    end
    str.each_byte {|byte|
      if ((byte >= 33 && byte <= 63) || (byte >= 96 && byte <= 126))
        string << "\xFF" + [byte ^ 32].pack('C')
      elsif (byte >= 64 && byte <= 95)
        string << "\xFF" + [byte ^ 96].pack('C')
      else
        char = [byte].pack('C')
        possible = @@codepage_map_cache[mode]['data'][char]
        if possible.nil?
          raise TypeError, "codepage #{mode} does not provide an encoding for 0x#{char.unpack('H*')[0]}"
        end
        string << possible[ rand(possible.length) ]
      end
    }
    return string
  else
    raise TypeError, 'invalid utf type'
  end
end

Method: Rex::Text.to_unicode

.to_unicode(str = '', type = 'utf-16le', mode = '', size = '') ⇒ Object

.to_unicode(str = '', type = 'utf-16le', mode = '', size = '') ⇒ `Object`