Module: MojiBake::EncodingSupport
- Included in:
- Mapper
- Defined in:
- lib/mojibake/encoding.rb
Overview
Mixin for the actual (ruby 1.9 backed) encoding support to define the mojibake mapping table and regex.
Constant Summary collapse
- W252 =
Encoding::WINDOWS_1252
- ISO8 =
Encoding::ISO_8859_1
- UTF8 =
Encoding::UTF_8
- HIGH_ORDER_CHARS =
The 8-bit high-order characters assigned in Windows-1252, as UTF8. This is actually a superset of ISO-8859-1 high order set, including in particular, punctuation characters like EM DASH and RIGHT DOUBLE QUOTATION MARK. These are the most common problem chars in English and probably most latin languages.
( Array( 0x80..0xFF ) - [ 0x81, 0x8D, 0x8F, 0x90, 0x9D ] ). map { |i| i.chr( W252 ).encode( UTF8 ) }. sort
- INTEREST_CODEPOINTS =
Additional Unicode codepoints of mojibake potential, like alt whitespace, C1 control characters, and BOMs.
[ 0x0080..0x009F, # ISO/Unicode C1 control codes. 0x00A0, # NO-BREAK SPACE 0x2000..0x200B, # EN QUAD ... ZERO WIDTH SPACE 0x2060, # WORD JOINER 0xfeff, # ZERO WIDTH SPACE, BYTE-ORDER-MARK (BOM) 0xfffd, # REPLACEMENT CHARACTER 0xfffe ]. # UNASSIGNED, BAD BOM map { |i| Array( i ) }. flatten. sort
- INTEREST_CHARS =
INTEREST_CODEPOINTS.map { |c| c.chr( UTF8 ) }
- CANDIDATE_CHARS =
Mojibake candidate characters in reverse; HIGH_ORDER_CHARS and lowest codepoints have highest precedence.
( HIGH_ORDER_CHARS + INTEREST_CHARS ).reverse
Instance Attribute Summary collapse
-
#map_iso_8859_1 ⇒ Object
Include ISO-8859-1 transcodes in map (default: true).
-
#map_permutations ⇒ Object
Include permutations between ISO-8859-1 and Windows-1252 (default: true).
-
#map_windows_1252 ⇒ Object
Include Windows-1252 transcodes in map (default: true).
Instance Method Summary collapse
- #char_tree(seqs) ⇒ Object
-
#codepoints_hex(s) ⇒ Object
Unicode hex dump of codepoints.
-
#hash ⇒ Object
Return Hash of mojibake UTF-8 2-3 character sequences to original UTF-8 (recovered) characters.
- #initialize ⇒ Object
- #regex_encode(c) ⇒ Object
-
#regexp ⇒ Object
A Regexp that will match any of the mojibake sequences, as found in hash.keys.
-
#table ⇒ Object
Return pretty table formatting of hash (array of lines).
- #tree_flatten(tree) ⇒ Object
Instance Attribute Details
#map_iso_8859_1 ⇒ Object
Include ISO-8859-1 transcodes in map (default: true)
61 62 63 |
# File 'lib/mojibake/encoding.rb', line 61 def map_iso_8859_1 @map_iso_8859_1 end |
#map_permutations ⇒ Object
Include permutations between ISO-8859-1 and Windows-1252 (default: true). This covers ambiguities of C1 control codes.
65 66 67 |
# File 'lib/mojibake/encoding.rb', line 65 def map_permutations @map_permutations end |
#map_windows_1252 ⇒ Object
Include Windows-1252 transcodes in map (default: true)
58 59 60 |
# File 'lib/mojibake/encoding.rb', line 58 def map_windows_1252 @map_windows_1252 end |
Instance Method Details
#char_tree(seqs) ⇒ Object
125 126 127 128 129 130 131 132 |
# File 'lib/mojibake/encoding.rb', line 125 def char_tree( seqs ) seqs.inject( {} ) do |h,seq| seq.chars.inject( h ) do |hs,c| hs[c] ||= {} end h end end |
#codepoints_hex(s) ⇒ Object
Unicode hex dump of codepoints
159 160 161 |
# File 'lib/mojibake/encoding.rb', line 159 def codepoints_hex( s ) s.codepoints.map { |i| sprintf( "%04X", i ) }.join( ' ' ) end |
#hash ⇒ Object
Return Hash of mojibake UTF-8 2-3 character sequences to original UTF-8 (recovered) characters
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
# File 'lib/mojibake/encoding.rb', line 76 def hash @hash ||= CANDIDATE_CHARS.inject( {} ) do |h,c| # Mis-interpret as ISO-8859-1, and encode back to UTF-8 moji_8 = c.encode( UTF8, ISO8 ) h[moji_8] = c if @map_iso_8859_1 # Mis-interpret as Windows-1252, and encode back to UTF-8 moji_w = c.encode( UTF8, W252, :undef => :replace ) h[moji_w] = c if @map_windows_1252 if @map_permutations # Also add permutations of unassigned Windows-1252 chars to # the 8bit equivalent. i = 0 moji_w.each_codepoint do |cp| if cp == 0xFFFD moji_n = moji_w.dup moji_n[i] = moji_8[i] h[moji_n] = c end i += 1 end end h end end |
#initialize ⇒ Object
67 68 69 70 71 72 |
# File 'lib/mojibake/encoding.rb', line 67 def initialize super @map_windows_1252 = true @map_iso_8859_1 = true @map_permutations = true end |
#regex_encode(c) ⇒ Object
163 164 165 166 167 168 169 170 |
# File 'lib/mojibake/encoding.rb', line 163 def regex_encode( c ) i = c.each_codepoint.next #only one if INTEREST_CODEPOINTS.include?( i ) sprintf( '\u%04X', i ) else Regexp.escape( c ) end end |
#regexp ⇒ Object
A Regexp that will match any of the mojibake sequences, as found in hash.keys.
121 122 123 |
# File 'lib/mojibake/encoding.rb', line 121 def regexp @regexp ||= Regexp.new( tree_flatten( char_tree( hash.keys ) ) ) end |
#table ⇒ Object
Return pretty table formatting of hash (array of lines)
106 107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/mojibake/encoding.rb', line 106 def table lines = [ "# -*- coding: utf-8 -*- mojibake: #{MojiBake::VERSION}" ] lines << regexp.inspect lines << "" lines << "Moji\tUNICODE \tOrg\tCODE" lines << "+----\t---- ---- ----\t-----\t---+" lines += hash.sort.map do |moji,c| "[%s]\t%s\t[%s]\t%s" % [ moji, codepoints_hex( moji ), c, codepoints_hex( c ) ] end lines end |
#tree_flatten(tree) ⇒ Object
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
# File 'lib/mojibake/encoding.rb', line 134 def tree_flatten( tree ) cs = tree.sort.map do |k,v| o = regex_encode( k ) unless v.empty? c = tree_flatten( v ) o << if c =~ /^\[.*\]$/ || v.length == 1 c else '(' + c + ')' end end o end if cs.find { |o| o =~ /[()|\[\]]/ } cs.join( '|' ) else if cs.length > 1 '[' + cs.inject(:+) + ']' else cs.first end end end |