Class: Encoding
- Inherits:
-
Object
- Object
- Encoding
- Defined in:
- lib/distorted/monkey_business/encoding.rb
Constant Summary collapse
- CODE_PAGE_ENCODING_NAME =
Define a Regexp to match and extract Ruby’s built-in numeric codepage IDs from thir Encoding’s names.
Using IGNORECASE to handle the duplicate differing-capitalization constants, e.g. Encoding::WINDOWS_31J and Encoding::Windows_31J both exist and are equivalent.
Worth mentioning since this file deals with Encoding, but the Regexp itself also has an internal Encoding that can be changed if I had any reason to (I don’t): ruby-doc.org/core/Regexp.html#class-Regexp-label-Encoding
Regexp.new('^(CP|IBM|Windows[-_])(?<code_page>\d{3,}$)', Regexp::IGNORECASE)
- ADDITIONAL_ENCODING_CODE_PAGE_IDS =
Data sources: www.aivosto.com/articles/charsets-codepages.html developer.apple.com/documentation/coreservices/1400434-ms-dos_and_windows_text_encodings docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers en.wikipedia.org/wiki/CCSID github.com/SheetJS/js-codepage/blob/master/codepage.md
{ # Burgerland :911: Encoding::US_ASCII => 20127, # Unicode Encoding::UTF_16LE => 1200, Encoding::UTF_16BE => 1201, Encoding::UTF_32LE => 12000, Encoding::UTF_32BE => 12001, ## 245 # # Code Page 932 is Windows-31J, but I want to provide fallback mapping # between 932 and Shift_JIS to handle detected-text or `encoding` arguments # that return Shift_JIS since that naming is much much more well-known than 31J. Encoding::SHIFT_JIS => 932, # https://referencesource.microsoft.com/#mscorlib/system/text/eucjpencoding.cs # https://www.redmine.org/issues/29442 # https://www.sljfaq.org/afaq/encodings.html # https://uic.jp/charset/ # http://www.monyo.com/technical/samba/docs/Japanese-HOWTO-3.0.en.txt Encoding::EUC_JP_MS => 20932, Encoding::EUC_JP => 51932, # Encoding:EUC-JIS-2004 dunno # # https://www.debian.org/doc/manuals/intro-i18n/ch-coding.en.html 3.2: Stateless and Stateful # TL;DR: Stateful uses an escape sequence to switch charset; # Stateless have all-unique codepoints. # Normal ISO-2022-JP is stateful. # "For example, in ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'が' # or two ASCII character of '$' and ',' according to the shift state." # Encoding::STATELESS_ISO_2022_JP # # Mobile operator specific encodings that I have no numeric IDs for rn: # Encoding:UTF8-DoCoMo # Encoding:SJIS-DoCoMo # Encoding:UTF8-KDDI # Encoding:SJIS-KDDI # Encoding:stateless-ISO-2022-JP-KDDI # Encoding:UTF8-SoftBank # Encoding:SJIS-SoftBank ## CHY-NAH # # https://en.wikipedia.org/wiki/Code_page_903 Encoding::GB1988 => 903, # ## Hong Kong Supplementary Character Set # The Windows version of this seems to be the built-in CP951: # https://web.archive.org/web/20160402215421/https://blogs.msdn.microsoft.com/shawnste/2007/03/12/cp-951-hkscs/ # https://web.archive.org/web/20141129233053/http://www-01.ibm.com/software/globalization/ccsid/ccsid5471.html Encoding::BIG5_HKSCS => 5417, # # The 936 postfix is a reference to the standard Windows Chinese encoding being CP936 / GBK. # "GB2312 is the registered internet name for EUC-CN, which is its usual encoded form." Encoding::GB2312 => 20936, Encoding::GB12345 => 51936, #Encoding:GB2312_HZ => 52936, # Doesn't exist in Ruby Encoding::GB18030 => 54936, ## Asia At Odd Hours # # I always wondered if the "Gravitational Pull of Pepsi" logo came from # them wanting it to look less like the Korean flag. # The traditional Korean Windows Code Page is CP949, available in Ruby # but not under any other name aliases. # IBM uses CP1363, not in Ruby. Encoding::EUC_KR => 51949, # # ROC me now Encoding::EUC_TW => 51950, # Unicode 補完計畫 / Unicode-At-On is a Big5 variant once popular in Taiwan: # https://lists.gnu.org/archive/html/bug-gnu-libiconv/2010-11/msg00007.html # https://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0061.html # Encoding::BIG5_UAO # # CP950 (available in Ruby) is the code page used on Windows under the name "big5', # but I want to map the generic Big5 Encoding to 950 as well to handle # detected and specified encodings by that name. # "The major difference between Windows code page 950 and "common" (non-vendor-specific) Big5 # is the incorporation of a subset of the ETEN extensions to Big5 at 0xF9D6 through 0xF9FE # (comprising the seven Chinese characters 碁, 銹, 裏, 墻, 恒, 粧, and 嫺, # followed by 34 box drawing characters and block elements)." Encoding::Big5 => 950, # # Encoding::TIS_620 is the base Thai 8-bit encoding standard that is apparently # never actually used in the wild. # ISO-8859-11 is identical to it with the sole exception "that ISO/IEC 8859-11 # allocates non-breaking space to code 0xA0, while TIS-620 leaves it undefined." # "The Microsoft Windows code page 874 as well as the code page used in the # Thai version of the Apple Macintosh, MacThai, # are variants of TIS-620 — incompatible with each other, however." # Eastern Yurp #Encoding::KOI8_R => 20866, Encoding::KOI8_U => 21866, ## ISO/IEC 8859 (8-bit) encoding family # Encoding::ISO_8859_1 => 28591, # West European languages (Latin-1) Encoding::ISO_8859_2 => 28592, # Central and East European languages (Latin-2) Encoding::ISO_8859_3 => 28593, # Southeast European and miscellaneous languages (Latin-3) Encoding::ISO_8859_4 => 28594, # Scandinavian/Baltic languages (Latin-4) Encoding::ISO_8859_5 => 28595, # Latin/Cyrillic Encoding::ISO_8859_6 => 28596, # Latin/Arabic Encoding::ISO_8859_7 => 28597, # Latin/Greek Encoding::ISO_8859_8 => 28598, # Latin/Hebrew Encoding::ISO_8859_9 => 28599, # Latin-1 modification for Turkish (Latin-5) # # ISO-8859-10 covers Nordic languages better than ISO_8859_4. # Wikipedia says this has been assigned in Windows as 28600 even though Microsoft's # page doesn't list it now in 2020, but w/e. # IBM assigned it as CP919. Encoding::ISO_8859_10 => 28600, # Lappish/Nordic/Eskimo languages (Latin-6) # # Wikipedia says this is assigned, but same deal. Encoding::ISO_8859_11 => 28601, # Latin/Thai # # Intended Celtic encoding abandoned in 1997 in favor of ISO_8859_14: # Encoding::ISO_8859_12 => 28602, # Encoding::ISO_8859_13 => 28603, # Baltic Rim languages (Latin-7) Encoding::ISO_8859_14 => 28604, # Celtic (Latin-8) Encoding::ISO_8859_15 => 28605, # West European languages (Latin-9) Encoding::ISO_8859_16 => 28606, # Romanian (Latin-10) # Apple encodings # # UTF8_MAC is the encoding Mac OS X uses on HFS+ filesystems and is a variant of UTF-8-NFD. # https://web.archive.org/web/20140812023313/http://developer.apple.com/library/ios/documentation/MacOSX/Conceptual/BPInternational/Articles/FileEncodings.html # "Mac OS Extended (HFS+) uses canonically decomposed Unicode 3.2 in UTF-16 format, # which consists of a sequence of 16-bit codes. # (Characters in the ranges U2000-U2FFF, UF900-UFA6A, and U2F800-U2FA1D are not decomposed.)" # # There isn't a good Microsoft-style ID I can assign to it, so this is just FYI. # Classic Mac encodings # # https://en.wikipedia.org/wiki/Category:Mac_OS_character_encodings # http://mirror.informatimago.com/next/developer.apple.com/documentation/macos8/TextIntlSvcs/TextEncodingConversionManager/TEC1.5/TEC.1b.html # # MacRoman pre-OS-8.5 has the "Universal currency symbol" at 0xDB, # while 8.5 and later replace it with the (then-new) Euro symbol: # https://en.wikipedia.org/wiki/Currency_sign_(typography) Encoding::MACROMAN => 10000, # # "Shift-JIS with JIS Roman modifications, extra 1-byte characters, 2-byte Apple extensions, # and some vertical presentation forms in the range 0xEB40--0xEDFE ("ku plus 84")." # Ruby also defines Encoding::MACJAPAN but it's the same Encoding. Encoding::MACJAPANESE => 10001, # # The following encodings are not defined in Ruby's Encoding class, # but I'm listing them here for completeness' sake. # MACCHINESETRAD => 10002, # MACKOREAN => 10003, # MACARABIC => 10004, # MACHEBREW => 10005, # MACGREEK => 10006, # MACCYRILLIC => 10007, # MACHINESESIMP => 10008, # # Unlike MacJapan/MacJapanese, MacRomania is something different than MacRoman. Encoding::MACROMANIA => 10010, # Encoding::MACUKRAINE => 10017, Encoding::MACTHAI => 10021, Encoding::MACCENTEURO => 10029, Encoding::MACICELAND => 10079, Encoding::MACTURKISH => 10081, Encoding::MACCROATIAN => 10082, }
Class Method Summary collapse
-
.adopted_encoding_code_page_ids ⇒ Object
Returns a Hash of the built-in-orphan Encodings we now have codepage IDs for, e.g.
-
.code_page_orphans ⇒ Object
Returns a Set of built-in Encodings whose :names /!\ DO NOT /!\ contain a usable numeric codepage ID, as matched by our Regexp.
-
.page_code(code_page_id) ⇒ Object
Returns the Encoding instance of any Integer codepage ID.
Instance Method Summary collapse
-
#code_page ⇒ Object
Returns the Integer codepage ID of any Encoding instance.
Class Method Details
.adopted_encoding_code_page_ids ⇒ Object
Returns a Hash of the built-in-orphan Encodings we now have codepage IDs for, e.g. #<Encoding:UTF-16BE>=>1201, #<Encoding:UTF-16LE>=>1200
264 265 266 267 268 269 270 271 272 273 274 275 276 |
# File 'lib/distorted/monkey_business/encoding.rb', line 264 def self.adopted_encoding_code_page_ids @@adopted_encoding_code_page_ids ||= self::code_page_orphans.select{ |e| if self::ADDITIONAL_ENCODING_CODE_PAGE_IDS.has_key?(e) # irb> Encoding.const_defined?('CP932') # => true not Encoding::const_defined?("CP#{self::ADDITIONAL_ENCODING_CODE_PAGE_IDS[e]}") else false end }.map{ |e| [e, self::ADDITIONAL_ENCODING_CODE_PAGE_IDS[e]] }.to_h end |
.code_page_orphans ⇒ Object
Returns a Set of built-in Encodings whose :names /!\ DO NOT /!\ contain a usable numeric codepage ID, as matched by our Regexp.
280 281 282 283 284 |
# File 'lib/distorted/monkey_business/encoding.rb', line 280 def self.code_page_orphans Encoding.list.select{ |c| c.respond_to?(:names) ? (not c.names.any?{|n| CODE_PAGE_ENCODING_NAME.match(n)}) : false }.to_set end |
.page_code(code_page_id) ⇒ Object
Returns the Encoding instance of any Integer codepage ID.
287 288 289 290 |
# File 'lib/distorted/monkey_business/encoding.rb', line 287 def self.page_code(code_page_id) # Every canonically-Windows*/IBM*-named Encoding seems to also have a 'CP<whatever>' equivalent. Encoding::find("CP#{code_page_id}") rescue nil end |
Instance Method Details
#code_page ⇒ Object
Returns the Integer codepage ID of any Encoding instance.
293 294 295 296 297 |
# File 'lib/distorted/monkey_business/encoding.rb', line 293 def code_page Encoding::adopted_encoding_code_page_ids.dig(self) || self.names.any?{ |n| CODE_PAGE_ENCODING_NAME.match(n) } ? Regexp.last_match['code_page'.freeze].to_i : nil end |