Method: Rex::Text.to_unicode

Defined in:
lib/rex/text.rb

.to_unicode(str = '', type = 'utf-16le', mode = '', size = '') ⇒ Object

Converts standard ASCII text to a unicode string.

Supported unicode types include: utf-16le, utf16-be, utf32-le, utf32-be, utf-7, and utf-8

Providing ‘mode’ provides hints to the actual encoder as to how it should encode the string.

Only UTF-7 and UTF-8 use “mode”.

utf-7 by default does not encode alphanumeric and a few other characters. By specifying the mode of “all”, then all of the characters are encoded, not just the non-alphanumeric set. to_unicode(str, ‘utf-7’, ‘all’)

utf-8 specifies that alphanumeric characters are used directly, eg “a” is just “a”. However, there exist 6 different overlong encodings of “a” that are technically not valid, but parse just fine in most utf-8 parsers. (0xC1A1, 0xE081A1, 0xF08081A1, 0xF8808081A1, 0xFC80808081A1, 0xFE8080808081A1). How many bytes to use for the overlong enocding is specified providing ‘size’. to_unicode(str, ‘utf-8’, ‘overlong’, 2)

Many utf-8 parsers also allow invalid overlong encodings, where bits that are unused when encoding a single byte are modified. Many parsers will ignore these bits, rendering simple string matching to be ineffective for dealing with UTF-8 strings. There are many more invalid overlong encodings possible for “a”. For example, three encodings are available for an invalid 2 byte encoding of “a”. (0xC1E1 0xC161 0xC121).

By specifying “invalid”, a random invalid encoding is chosen for the given byte size. to_unicode(str, ‘utf-8’, ‘invalid’, 2)

utf-7 defaults to ‘normal’ utf-7 encoding utf-8 defaults to 2 byte ‘normal’ encoding



592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
# File 'lib/rex/text.rb', line 592

def self.to_unicode(str='', type = 'utf-16le', mode = '', size = '')
  return '' if not str
  case type
  when 'utf-16le'
    return str.unpack('C*').pack('v*')
  when 'utf-16be'
    return str.unpack('C*').pack('n*')
  when 'utf-32le'
    return str.unpack('C*').pack('V*')
  when 'utf-32be'
    return str.unpack('C*').pack('N*')
  when 'utf-7'
    case mode
    when 'all'
      return str.gsub(/./){ |a|
        out = ''
        if 'a' != '+'
          out = encode_base64(to_unicode(a, 'utf-16be')).gsub(/[=\r\n]/, '')
        end
        '+' + out + '-'
      }
    else
      return str.gsub(/[^\n\r\t\ A-Za-z0-9\'\(\),-.\/\:\?]/){ |a|
        out = ''
        if a != '+'
          out = encode_base64(to_unicode(a, 'utf-16be')).gsub(/[=\r\n]/, '')
        end
        '+' + out + '-'
      }
    end
  when 'utf-8'
    if size == ''
      size = 2
    end

    if size >= 2 and size <= 7
      string = ''
      str.each_byte { |a|
        if (a < 21 || a > 0x7f) || mode != ''
          # ugh.	turn a single byte into the binary representation of it, in array form
          bin = [a].pack('C').unpack('B8')[0].split(//)

          # even more ugh.
          bin.collect!{|a_| a_.to_i}

          out = Array.new(8 * size, 0)

          0.upto(size - 1) { |i|
            out[i] = 1
            out[i * 8] = 1
          }

          i = 0
          byte = 0
          bin.reverse.each { |bit|
            if i < 6
              mod = (((size * 8) - 1) - byte * 8) - i
              out[mod] = bit
            else
              byte = byte + 1
              i = 0
              redo
            end
            i = i + 1
          }

          if mode != ''
            case mode
            when 'overlong'
              # do nothing, since we already handle this as above...
            when 'invalid'
              done = 0
              while done == 0
                # the ghetto...
                bits = [7, 8, 15, 16, 23, 24, 31, 32, 41]
                bits.each { |bit|
                  bit = (size * 8) - bit
                  if bit > 1
                    set = rand(2)
                    if out[bit] != set
                      out[bit] = set
                      done = 1
                    end
                  end
                }
              end
            else
              raise TypeError, 'Invalid mode.  Only "overlong" and "invalid" are acceptable modes for utf-8'
            end
          end
          string << [out.join('')].pack('B*')
        else
          string << [a].pack('C')
        end
      }
      return string
    else
      raise TypeError, 'invalid utf-8 size'
    end
  when 'uhwtfms' # suggested name from HD :P
    load_codepage()

    string = ''
    # overloading mode as codepage
    if mode == ''
      mode = 1252 # ANSI - Latan 1, default for US installs of MS products
    else
      mode = mode.to_i
    end
    if @@codepage_map_cache[mode].nil?
      raise TypeError, "Invalid codepage #{mode}"
    end
    str.each_byte {|byte|
      char = [byte].pack('C*')
      possible = @@codepage_map_cache[mode]['data'][char]
      if possible.nil?
        raise TypeError, "codepage #{mode} does not provide an encoding for 0x#{char.unpack('H*')[0]}"
      end
      string << possible[ rand(possible.length) ]
    }
    return string
  when 'uhwtfms-half' # suggested name from HD :P
    load_codepage()
    string = ''
    # overloading mode as codepage
    if mode == ''
      mode = 1252 # ANSI - Latan 1, default for US installs of MS products
    else
      mode = mode.to_i
    end
    if mode != 1252
      raise TypeError, "Invalid codepage #{mode}, only 1252 supported for uhwtfms_half"
    end
    str.each_byte {|byte|
      if ((byte >= 33 && byte <= 63) || (byte >= 96 && byte <= 126))
        string << "\xFF" + [byte ^ 32].pack('C')
      elsif (byte >= 64 && byte <= 95)
        string << "\xFF" + [byte ^ 96].pack('C')
      else
        char = [byte].pack('C')
        possible = @@codepage_map_cache[mode]['data'][char]
        if possible.nil?
          raise TypeError, "codepage #{mode} does not provide an encoding for 0x#{char.unpack('H*')[0]}"
        end
        string << possible[ rand(possible.length) ]
      end
    }
    return string
  else
    raise TypeError, 'invalid utf type'
  end
end