charesc version 0.1.0

Overview

Many programming languages and data formats provide character escapes based on Unicode. This gem, ‘charesc’, does so for Ruby.

Syntax

Character Escapes are defined as constants starting with the letter ‘U’, followed by at least four hexadecimal digits. Four hexadecimal digits represent characters in the basic multilingual plane (BMP) of Unicode/ISO 10646. Five hexadecimal digits represent characters in planes 1 to 15. Six hexadecimal digits, starting with ‘10’, represent characters in plane 16. Up and including four digits, leading zeros are mandatory, but otherwise, they are forbidden. In this respect, the syntax is the same as for the U+ notation from the Unicode book.

Usage

Character escapes can be used inside strings, with the interpolation syntax, e.g., “abcd#U6789u789A”. They can also be used on their own, as free-standing constants, e.g., “abcd” + U6789u789A.

Returned Values

All codepoints including non-characters (e.g. U+FFFF) are available, but surrogates (U+D800-U+DFFF) are not available, guaranteeing that no ill-formed UTF-8 sequences are produced. Character escapes can either be used as individual characters (e.g., U6789) or in strings (e.g., U6789U789A). Starting from the second ‘U’, it is possible to use ‘u’ instead for easier visual parsing (e.g., U6789u789A). The hexadecimal characters A-F can always also be written lower-case. The value of a character escape is never a character (e.g., ?a), always a string.

Character Escapes and Character Encodings

The charesc gem takes the value of $KCODE into account automatically. If $KCODE is set to Shift_JIS or EUC-JP, the character escapes are coverted to the respective encoding (as far as allowed by these encodings). If $KCODE indicates UTF-8 or ‘none’, character escapes return their values in UTF-8. By redefining the method charesc_non_utf8_conversion_hook, it is possible to change this behavior if necessary.

Future Work

  • Adapt syntax if there is community consensus for something different (warning: discussing syntactic details can become a rathole).

  • Make this part of the standard Ruby distribution, or even better, integrate it into Ruby itself. In the later case, the syntax can be reconsidered, because we can then e.g. use u.… or so.

Copyright © 2007 Martin J. Du“rst ([email protected]) Licensed under the same terms as Ruby. Absolutely no warranty. (see www.ruby-lang.org/en/LICENSE.txt)