ICU - A Unicode processing functions ruby gem - binding to ICU

Beta stage.

Ruby required 2.3.1.

Build Status

Usage

If you use OS X,

brew install icu4c
gem install icu -- --use-system-libraries

else,

gem install icu

For usage:

require 'icu'

Design

Almost all arguments passed should be expected as Ruby String with various encodings. Sometimes, symbol is also allowed. More specifically, ICU::Locale accepts only ASCII-compatible string.

Ruby API

Ruby API should be higher level and easily configurable. The underlying encoding conversion should be transparent to the user.

Encoding

Ruby has an Code Set Independent (CSI) model for string implementation because of its community. Ruby's string should honour the Encoding.default_internal, otherwise __ENCODING__ (including newly created string) from the environment is used for encoding string. The string holds only the byte array. A string may have unmatched encoding or invalid bytes.

ICU uses UTF-16 internally. But there is also dedicate fast UTF-8 code path. In most environment for Ruby community, MRI will take UTF-8 as string's representation. So the use of UTF-8 code path should be considered. If possible and matches with the MRI encoding, ICU should also be compiled for -DU_CHARSET_IS_UTF8=1.

Considered the fact above, the instances shall follow the encoding settings to return the desired encoding generally. The input string can be treated as UTF-8 can be used for its code path. Otherwise, a conversion by ICU should be employed. The output string should honor the encoding settings. The conversion should be transparent to Ruby users.

Some details about MRI and encoding:

  • string.pack("U*") actually returns Unicode Scalar value. While n*rules don't know UTF16.
  • macro ENCODING_GET retrieves an object's encoding index. The encoding can be in object's RBasic or an instance variable in that object depending on encoding's index.
  • rb_default_internal_encoding() and rb_enc_default_internal() returns the c struct encoding and ruby encoding object accordingly.
  • rb_locale_encindex() gets the encoding index from the locale.

Contributing

Feel free to fork and submit a pull request.

TODO

  • Support Ruby 2.2+. Rails 5 requires Ruby 2.2.2.
  • Merge ffi-icu. This branch can be a start
  • Merge some resources from this branch (old icu gem).
  • port time/number_formatting module from ffi-icu.
  • binary distribution of ICU & system library support
  • documentation