Module: MojiBake::EncodingSupport

Included in:
Mapper
Defined in:
lib/mojibake/encoding.rb

Overview

Mixin for the actual (ruby 1.9 backed) encoding support to define the mojibake mapping table and regex.

Constant Summary collapse

W252 =
Encoding::WINDOWS_1252
ISO8 =
Encoding::ISO_8859_1
UTF8 =
Encoding::UTF_8
HIGH_ORDER_CHARS =

The 8-bit high-order characters assigned in Windows-1252, as UTF8. This is actually a superset of ISO-8859-1 high order set, including in particular, punctuation characters like EM DASH and RIGHT DOUBLE QUOTATION MARK. These are the most common problem chars in English and probably most latin languages.

( Array( 0x80..0xFF ) - [ 0x81, 0x8D, 0x8F, 0x90, 0x9D ] ).
map { |i| i.chr( W252 ).encode( UTF8 ) }.
sort
INTEREST_CODEPOINTS =

Additional Unicode codepoints of mojibake potential, like alt whitespace, C1 control characters, and BOMs.

[ 0x0080..0x009F, # ISO/Unicode C1 control codes.
  0x00A0,         # NO-BREAK SPACE
  0x2000..0x200B, # EN QUAD ... ZERO WIDTH SPACE
  0x2060,         # WORD JOINER
  0xfeff,         # ZERO WIDTH SPACE, BYTE-ORDER-MARK (BOM)
  0xfffd,         # REPLACEMENT CHARACTER
  0xfffe ].       # UNASSIGNED, BAD BOM
map { |i| Array( i ) }.
flatten.
sort
INTEREST_CHARS =
INTEREST_CODEPOINTS.map { |c| c.chr( UTF8 ) }
CANDIDATE_CHARS =

Mojibake candidate characters in reverse; HIGH_ORDER_CHARS and lowest codepoints have highest precedence.

( HIGH_ORDER_CHARS + INTEREST_CHARS ).reverse

Instance Attribute Summary collapse

Instance Method Summary collapse

Instance Attribute Details

#map_iso_8859_1Object

Include ISO-8859-1 transcodes in map (default: true)



61
62
63
# File 'lib/mojibake/encoding.rb', line 61

def map_iso_8859_1
  @map_iso_8859_1
end

#map_permutationsObject

Include permutations between ISO-8859-1 and Windows-1252 (default: true). This covers ambiguities of C1 control codes.



65
66
67
# File 'lib/mojibake/encoding.rb', line 65

def map_permutations
  @map_permutations
end

#map_windows_1252Object

Include Windows-1252 transcodes in map (default: true)



58
59
60
# File 'lib/mojibake/encoding.rb', line 58

def map_windows_1252
  @map_windows_1252
end

Instance Method Details

#char_tree(seqs) ⇒ Object



125
126
127
128
129
130
131
132
# File 'lib/mojibake/encoding.rb', line 125

def char_tree( seqs )
  seqs.inject( {} ) do |h,seq|
    seq.chars.inject( h ) do |hs,c|
      hs[c] ||= {}
    end
    h
  end
end

#codepoints_hex(s) ⇒ Object

Unicode hex dump of codepoints



159
160
161
# File 'lib/mojibake/encoding.rb', line 159

def codepoints_hex( s )
  s.codepoints.map { |i| sprintf( "%04X", i ) }.join( ' ' )
end

#hashObject

Return Hash of mojibake UTF-8 2-3 character sequences to original UTF-8 (recovered) characters



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# File 'lib/mojibake/encoding.rb', line 76

def hash
  @hash ||= CANDIDATE_CHARS.inject( {} ) do |h,c|

    # Mis-interpret as ISO-8859-1, and encode back to UTF-8
    moji_8 = c.encode( UTF8, ISO8 )
    h[moji_8] = c if @map_iso_8859_1

    # Mis-interpret as Windows-1252, and encode back to UTF-8
    moji_w = c.encode( UTF8, W252, :undef => :replace )
    h[moji_w] = c if @map_windows_1252

    if @map_permutations
      # Also add permutations of unassigned Windows-1252 chars to
      # the 8bit equivalent.
      i = 0
      moji_w.each_codepoint do |cp|
        if cp == 0xFFFD
          moji_n = moji_w.dup
          moji_n[i] = moji_8[i]
          h[moji_n] = c
        end
        i += 1
      end
    end

    h
  end
end

#initializeObject



67
68
69
70
71
72
# File 'lib/mojibake/encoding.rb', line 67

def initialize
  super
  @map_windows_1252 = true
  @map_iso_8859_1   = true
  @map_permutations = true
end

#regex_encode(c) ⇒ Object



163
164
165
166
167
168
169
170
# File 'lib/mojibake/encoding.rb', line 163

def regex_encode( c )
  i = c.each_codepoint.next #only one
  if INTEREST_CODEPOINTS.include?( i )
    sprintf( '\u%04X', i )
  else
    Regexp.escape( c )
  end
end

#regexpObject

A Regexp that will match any of the mojibake sequences, as found in hash.keys.



121
122
123
# File 'lib/mojibake/encoding.rb', line 121

def regexp
  @regexp ||= Regexp.new( tree_flatten( char_tree( hash.keys ) ) )
end

#tableObject

Return pretty table formatting of hash (array of lines)



106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/mojibake/encoding.rb', line 106

def table
  lines = [ "# -*- coding: utf-8 -*- mojibake: #{MojiBake::VERSION}" ]
  lines << regexp.inspect
  lines << ""
  lines << "Moji\tUNICODE  \tOrg\tCODE"
  lines << "+----\t---- ---- ----\t-----\t---+"
  lines += hash.sort.map do |moji,c|
    "[%s]\t%s\t[%s]\t%s" %
      [ moji, codepoints_hex( moji ), c, codepoints_hex( c ) ]
  end
  lines
end

#tree_flatten(tree) ⇒ Object



134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# File 'lib/mojibake/encoding.rb', line 134

def tree_flatten( tree )
  cs = tree.sort.map do |k,v|
    o = regex_encode( k )
    unless v.empty?
      c = tree_flatten( v )
      o << if c =~ /^\[.*\]$/ || v.length == 1
             c
           else
             '(' + c + ')'
           end
    end
    o
  end
  if cs.find { |o| o =~ /[()|\[\]]/ }
    cs.join( '|' )
  else
    if cs.length > 1
      '[' + cs.inject(:+) + ']'
    else
      cs.first
    end
  end
end