Efficient Pure Ruby Unicode Normalization (eprun)

(pronounced e-prune)

The Talk

Please see the Internationalization & Unicode Conference 37 talk on Implementing Normalization in Pure Ruby - the Fast and Easy Way.

Directories and Files

  • lib/normalize.rb: The core normalization code.
  • lib/string_normalize.rm: String#normalize.
  • lib/generate.rb: Generation script, generates lib/normalize_tables.rb from data/UnicodeData.txt and data/CompositionExclusions.txt. This needs to be run only once when updating to a new Unicode version.
  • lib/normalize_tables.rb: Data used for normalization, automatically generated by lib/generate.rb.
  • data/: All three files in this directory are downloaded from the Unicode Character Database. They are currently at Unicode version 6.3. They need to be updated for a newer Unicode version (happens about once a year).
  • test/test_normalize.rb: Tests for lib/string_normalize.rb, using data/NormalizationTest.txt.
  • benchmark/benchmark.rb: Runs the benchmark with example text files. Automatically checks for existing gems/libraries; if e.g. the unicode_util gem is not available, that part of the benchmark is skipped. This also applies to eprun, which will not be run on Ruby 1.8.
  • benchmark/Deutsch_.txt, Japanese_.txt, Korean_.txt, Vietnamese_.txt: example texts extracted from random Wikipedia pages (see http://en.wikipedia.org/wiki/Wikipedia:Random). The languages are choosen based on number of characters affected by normalization (Deutsch < Japanese < Vietnamese < Korean). These files have somewhat differing lengths, so the results cannot directly be compared across languages. Adding other files with ending "_.txt" will include them in the benchmark.
  • benchmark/benchmark_results.rb: Results of benchmark for eprun, unicode_utils, ActiveSupport::Multibyte (version 3.0.0), twitter_cldr, and the unicode gem. Eprun, unicode_utils, and unicode normalizations are run 100 times each, ActiveSupport::Multibyte is run 10 times each, and twitter_cldr is run only 1 time (didn't want to wait any longer).
  • benchmark/benchmark_results_jruby.txt: Results of benchmark when using jruby (excludes unicode gem), version 1.7.4 (1.9.3p392, 2013-05-16 2390d3b on Java HotSpot(TM) Client VM 1.7.0_07-b10 [Windows 7-x86]).
  • benchmark/benchmark.pl: Runs the benchmark using Perl, both with xsub (i.e. C) version (run 100 times) and pure Perl version (run 10 times).
  • benchmark/benchmark_results_pl.txt: Results of Perl benchmarks.

TODOs and Ideas

  • Publish as a gem, or several gems.
  • Deal better with encodings other than UTF-8.
  • Add methods such as String#nfc, String#nfd,...
  • Add methods for normalization variants.
  • See talk for more.