Method: Janeway::Lexer#convert_surrogate_pair_to_codepoint

Defined in:
lib/janeway/lexer.rb

#convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex) ⇒ String

Convert a valid UTF-16 surrogate pair into a UTF-8 string containing a single code point.

Parameters:

  • high_surrogate_hex (String)

    string of hex digits, eg. “D83D”

  • low_surrogate_hex (String)

    string of hex digits, eg. “DE09”

Returns:

  • (String)

    UTF-8 string containing a single multi-byte unicode character, eg. “😉”



286
287
288
289
290
291
292
293
294
295
296
297
# File 'lib/janeway/lexer.rb', line 286

def convert_surrogate_pair_to_codepoint(high_surrogate_hex, low_surrogate_hex)
  [high_surrogate_hex, low_surrogate_hex].each do |hex_str|
    raise ArgumentError, "expect 4 hex digits, got #{hex_string.inspect}" unless hex_str.size == 4
  end

  # Calculate the code point from the surrogate pair values
  # algorithm from https://russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm
  high = high_surrogate_hex.hex
  low = low_surrogate_hex.hex
  codepoint = ((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000
  [codepoint].pack('U') # convert integer codepoint to single character string
end