Class: Encoding

Inherits:

Object

Object
Encoding

Defined in:: lib/distorted/monkey_business/encoding.rb

Constant Summary collapse

CODE_PAGE_ENCODING_NAME = Define a Regexp to match and extract Ruby’s built-in numeric codepage IDs from thir Encoding’s names. Using IGNORECASE to handle the duplicate differing-capitalization constants, e.g. Encoding::WINDOWS_31J and Encoding::Windows_31J both exist and are equivalent. Worth mentioning since this file deals with Encoding, but the Regexp itself also has an internal Encoding that can be changed if I had any reason to (I don’t): ruby-doc.org/core/Regexp.html#class-Regexp-label-Encoding

Regexp.new('^(CP|IBM|Windows[-_])(?<code_page>\d{3,}$)', Regexp::IGNORECASE)

ADDITIONAL_ENCODING_CODE_PAGE_IDS = Data sources: www.aivosto.com/articles/charsets-codepages.html developer.apple.com/documentation/coreservices/1400434-ms-dos_and_windows_text_encodings docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers en.wikipedia.org/wiki/CCSID github.com/SheetJS/js-codepage/blob/master/codepage.md

{

  # Burgerland :911:
  Encoding::US_ASCII => 20127,

  # Unicode
  Encoding::UTF_16LE => 1200,
  Encoding::UTF_16BE => 1201,
  Encoding::UTF_32LE => 12000,
  Encoding::UTF_32BE => 12001,

  ## 245
  #
  # Code Page 932 is Windows-31J, but I want to provide fallback mapping
  # between 932 and Shift_JIS to handle detected-text or `encoding` arguments
  # that return Shift_JIS since that naming is much much more well-known than 31J.
  Encoding::SHIFT_JIS => 932,
  # https://referencesource.microsoft.com/#mscorlib/system/text/eucjpencoding.cs
  # https://www.redmine.org/issues/29442
  # https://www.sljfaq.org/afaq/encodings.html
  # https://uic.jp/charset/
  # http://www.monyo.com/technical/samba/docs/Japanese-HOWTO-3.0.en.txt
  Encoding::EUC_JP_MS => 20932,
  Encoding::EUC_JP => 51932,
  # Encoding:EUC-JIS-2004 dunno
  #
  # https://www.debian.org/doc/manuals/intro-i18n/ch-coding.en.html  3.2: Stateless and Stateful
  # TL;DR: Stateful uses an escape sequence to switch charset;
  #        Stateless have all-unique codepoints.
  #        Normal ISO-2022-JP is stateful.
  # "For example, in ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'が'
  #  or two ASCII character of '$' and ',' according to the shift state."
  # Encoding::STATELESS_ISO_2022_JP
  #
  # Mobile operator specific encodings that I have no numeric IDs for rn:
  # Encoding:UTF8-DoCoMo
  # Encoding:SJIS-DoCoMo
  # Encoding:UTF8-KDDI
  # Encoding:SJIS-KDDI
  # Encoding:stateless-ISO-2022-JP-KDDI
  # Encoding:UTF8-SoftBank
  # Encoding:SJIS-SoftBank

  ## CHY-NAH
  #
  # https://en.wikipedia.org/wiki/Code_page_903
  Encoding::GB1988 => 903,
  #
  ## Hong Kong Supplementary Character Set
  # The Windows version of this seems to be the built-in CP951:
  # https://web.archive.org/web/20160402215421/https://blogs.msdn.microsoft.com/shawnste/2007/03/12/cp-951-hkscs/
  # https://web.archive.org/web/20141129233053/http://www-01.ibm.com/software/globalization/ccsid/ccsid5471.html
  Encoding::BIG5_HKSCS => 5417,
  #
  # The 936 postfix is a reference to the standard Windows Chinese encoding being CP936 / GBK.
  # "GB2312 is the registered internet name for EUC-CN, which is its usual encoded form."
  Encoding::GB2312 => 20936,
  Encoding::GB12345 => 51936,
  #Encoding:GB2312_HZ => 52936,  # Doesn't exist in Ruby
  Encoding::GB18030 => 54936,


  ## Asia At Odd Hours
  #
  # I always wondered if the "Gravitational Pull of Pepsi" logo came from
  # them wanting it to look less like the Korean flag.
  # The traditional Korean Windows Code Page is CP949, available in Ruby
  # but not under any other name aliases.
  # IBM uses CP1363, not in Ruby.
  Encoding::EUC_KR => 51949,
  #
  # ROC me now
  Encoding::EUC_TW => 51950,
  # Unicode 補完計畫 / Unicode-At-On is a Big5 variant once popular in Taiwan:
  # https://lists.gnu.org/archive/html/bug-gnu-libiconv/2010-11/msg00007.html
  # https://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0061.html
  # Encoding::BIG5_UAO
  #
  # CP950 (available in Ruby) is the code page used on Windows under the name "big5',
  # but I want to map the generic Big5 Encoding to 950 as well to handle
  # detected and specified encodings by that name.
  # "The major difference between Windows code page 950 and "common" (non-vendor-specific) Big5
  #  is the incorporation of a subset of the ETEN extensions to Big5 at 0xF9D6 through 0xF9FE
  #  (comprising the seven Chinese characters 碁, 銹, 裏, 墻, 恒, 粧, and 嫺,
  #  followed by 34 box drawing characters and block elements)."
  Encoding::Big5 => 950,
  #
  # Encoding::TIS_620 is the base Thai 8-bit encoding standard that is apparently
  # never actually used in the wild.
  # ISO-8859-11 is identical to it with the sole exception "that ISO/IEC 8859-11
  # allocates non-breaking space to code 0xA0, while TIS-620 leaves it undefined."
  # "The Microsoft Windows code page 874 as well as the code page used in the
  #  Thai version of the Apple Macintosh, MacThai,
  #  are variants of TIS-620 — incompatible with each other, however."


  # Eastern Yurp
  #Encoding::KOI8_R => 20866,
  Encoding::KOI8_U => 21866,

  ## ISO/IEC 8859 (8-bit) encoding family
  #
  Encoding::ISO_8859_1 => 28591,  # West European languages (Latin-1)
  Encoding::ISO_8859_2 => 28592,  # Central and East European languages (Latin-2)
  Encoding::ISO_8859_3 => 28593,  # Southeast European and miscellaneous languages (Latin-3)
  Encoding::ISO_8859_4 => 28594,  # Scandinavian/Baltic languages (Latin-4)
  Encoding::ISO_8859_5 => 28595,  # Latin/Cyrillic
  Encoding::ISO_8859_6 => 28596,  # Latin/Arabic
  Encoding::ISO_8859_7 => 28597,  # Latin/Greek
  Encoding::ISO_8859_8 => 28598,  # Latin/Hebrew
  Encoding::ISO_8859_9 => 28599,  # Latin-1 modification for Turkish (Latin-5)
  #
  # ISO-8859-10 covers Nordic languages better than ISO_8859_4.
  # Wikipedia says this has been assigned in Windows as 28600 even though Microsoft's
  # page doesn't list it now in 2020, but w/e.
  # IBM assigned it as CP919.
  Encoding::ISO_8859_10 => 28600,  # Lappish/Nordic/Eskimo languages (Latin-6)
  #
  # Wikipedia says this is assigned, but same deal.
  Encoding::ISO_8859_11 => 28601,  # Latin/Thai
  #
  # Intended Celtic encoding abandoned in 1997 in favor of ISO_8859_14:
  # Encoding::ISO_8859_12 => 28602,
  #
  Encoding::ISO_8859_13 => 28603,  # Baltic Rim languages (Latin-7)
  Encoding::ISO_8859_14 => 28604,  # Celtic (Latin-8)
  Encoding::ISO_8859_15 => 28605,  # West European languages (Latin-9)
  Encoding::ISO_8859_16 => 28606,  # Romanian (Latin-10)

  # Apple encodings
  #
  # UTF8_MAC is the encoding Mac OS X uses on HFS+ filesystems and is a variant of UTF-8-NFD.
  # https://web.archive.org/web/20140812023313/http://developer.apple.com/library/ios/documentation/MacOSX/Conceptual/BPInternational/Articles/FileEncodings.html
  # "Mac OS Extended (HFS+) uses canonically decomposed Unicode 3.2 in UTF-16 format,
  #  which consists of a sequence of 16-bit codes.
  #  (Characters in the ranges U2000-U2FFF, UF900-UFA6A, and U2F800-U2FA1D are not decomposed.)"
  #
  # There isn't a good Microsoft-style ID I can assign to it, so this is just FYI.

  # Classic Mac encodings
  #
  # https://en.wikipedia.org/wiki/Category:Mac_OS_character_encodings
  # http://mirror.informatimago.com/next/developer.apple.com/documentation/macos8/TextIntlSvcs/TextEncodingConversionManager/TEC1.5/TEC.1b.html
  #
  # MacRoman pre-OS-8.5 has the "Universal currency symbol" at 0xDB,
  # while 8.5 and later replace it with the (then-new) Euro symbol:
  #   https://en.wikipedia.org/wiki/Currency_sign_(typography)
  Encoding::MACROMAN => 10000,
  #
  # "Shift-JIS with JIS Roman modifications, extra 1-byte characters, 2-byte Apple extensions,
  #  and some vertical presentation forms in the range 0xEB40--0xEDFE ("ku plus 84")."
  #  Ruby also defines Encoding::MACJAPAN but it's the same Encoding.
  Encoding::MACJAPANESE => 10001,
  #
  # The following encodings are not defined in Ruby's Encoding class,
  # but I'm listing them here for completeness' sake.
  # MACCHINESETRAD => 10002,
  # MACKOREAN => 10003,
  # MACARABIC => 10004,
  # MACHEBREW => 10005,
  # MACGREEK => 10006,
  # MACCYRILLIC => 10007,
  # MACHINESESIMP => 10008,
  #
  # Unlike MacJapan/MacJapanese, MacRomania is something different than MacRoman.
  Encoding::MACROMANIA => 10010,
  #
  Encoding::MACUKRAINE => 10017,
  Encoding::MACTHAI => 10021,
  Encoding::MACCENTEURO => 10029,
  Encoding::MACICELAND => 10079,
  Encoding::MACTURKISH => 10081,
  Encoding::MACCROATIAN => 10082,

}

Class Method Summary collapse

.adopted_encoding_code_page_ids ⇒ Object

Returns a Hash of the built-in-orphan Encodings we now have codepage IDs for, e.g.
.code_page_orphans ⇒ Object

Returns a Set of built-in Encodings whose :names /!\ DO NOT /!\ contain a usable numeric codepage ID, as matched by our Regexp.
.page_code(code_page_id) ⇒ Object

Returns the Encoding instance of any Integer codepage ID.

Instance Method Summary collapse

#code_page ⇒ Object

Returns the Integer codepage ID of any Encoding instance.

Class Method Details

.adopted_encoding_code_page_ids ⇒ `Object`

Returns a Hash of the built-in-orphan Encodings we now have codepage IDs for, e.g. #<Encoding:UTF-16BE>=>1201, #<Encoding:UTF-16LE>=>1200

# File 'lib/distorted/monkey_business/encoding.rb', line 264

def self.adopted_encoding_code_page_ids
  @@adopted_encoding_code_page_ids ||= self::code_page_orphans.select{ |e|
    if self::ADDITIONAL_ENCODING_CODE_PAGE_IDS.has_key?(e)
      # irb> Encoding.const_defined?('CP932')
      # => true  
      not Encoding::const_defined?("CP#{self::ADDITIONAL_ENCODING_CODE_PAGE_IDS[e]}")
    else
      false
    end
  }.map{ |e|
    [e, self::ADDITIONAL_ENCODING_CODE_PAGE_IDS[e]]
  }.to_h
end

.code_page_orphans ⇒ `Object`

Returns a Set of built-in Encodings whose :names /!\ DO NOT /!\ contain a usable numeric codepage ID, as matched by our Regexp.

# File 'lib/distorted/monkey_business/encoding.rb', line 280

def self.code_page_orphans
  Encoding.list.select{ |c|
    c.respond_to?(:names) ? (not c.names.any?{|n| CODE_PAGE_ENCODING_NAME.match(n)}) : false
  }.to_set
end

.page_code(code_page_id) ⇒ `Object`

Returns the Encoding instance of any Integer codepage ID.

# File 'lib/distorted/monkey_business/encoding.rb', line 287

def self.page_code(code_page_id)
  # Every canonically-Windows*/IBM*-named Encoding seems to also have a 'CP<whatever>' equivalent.
  Encoding::find("CP#{code_page_id}") rescue nil
end

Instance Method Details

#code_page ⇒ `Object`

Returns the Integer codepage ID of any Encoding instance.

# File 'lib/distorted/monkey_business/encoding.rb', line 293

def code_page
  Encoding::adopted_encoding_code_page_ids.dig(self) ||
    self.names.any?{ |n| CODE_PAGE_ENCODING_NAME.match(n) } ?
      Regexp.last_match['code_page'.freeze].to_i : nil
end