Tinycus

This is a ruby library to do some string functions efficiently that would otherwise be slow or require a huge footprint. For example, it can remove accents from strings, or alphabetize strings in polytonic Greek.

The current implementation is about 2-3 times faster for these tasks than what I initially came up with naively. The footprint is about 1000 times smaller than that of the ICU library (30 Mb), which also doesn't have bindings for ruby. The name Tinycus is meant to evoke "tiny ICU." Tinycus supports polytonic Greek, which GNU libc doesn't.

If you're using Tinycus and have comments or suggestions, please contact me.

Installation

On linux, using make

sudo make install
make test

Using rubygems

gem install tinycus

Use

Examples:

require './tinycus.rb';
puts Tinycus::Tr.remove_accents_from_euro('ἄγε, vámonos',n:true)
--> αγε, vamonos
puts Tinycus.sort_greek("Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος οὐλομένην".split(/\s+/)).join(' ')
--> ἄειδε, Ἀχιλῆος θεά, Μῆνιν οὐλομένην Πηληϊάδεω

All input strings are expected to be utf-8 normalized to NFC form, and all returned values are also in this encoding. Many functions have an optional argument n which defaults to false. If you set n:true, as in the first example above, then your inputs will be normalized to NFC for you. This is safer but slower. Since the whole point of the library is speed, the library is set up to make it convenient for you if you simply massage all strings into the required form at the time when they're created or read in, then do all your manipulations. If your inputs to Tinycus are not NFC normalized, and you don't do n:true, then the results will be incorrect. If your inputs are in some other encoding such as ISO-8859-1, then the library may either give incorrect results or raise an exception.

When there is an error in a constructor, the object that is created has a .err property that is an error message. If there is no error, then the .err is set to nil.

Real-world sources of polytonic Greek text are usually incredibly messy, containing all kinds of weird crap that someone typed on a keyboard and looks OK by eye, but is actually wrong and not suitable for machine processing. This type of stuff is legal unicode, but it's the wrong way of representing the word. For instance, I've seen the vowel that's supposed to look like ά might be written with two accents on the same character: both an accented alpha unicode character and, superimposed on that, a combining accent. It looks OK on the screen because the two marks are on top of each other. I've collected a large number of these awfulnesses in the wild. The function Cleanup.clean_up_grotty_greek() is meant to correct them all. It's slow. It has a bunch of options. There are also various more fine-tuned or specialized functions, such as Cleanup.standardize_greek_punctuation().

Beta code conversion

Beta code is an obsolete way of encoding Greek characters: https://en.wikipedia.org/wiki/Beta_Code . Tinycus can handle conversion of a subset of beta code using the functions Tinycus.greek_unicode_to_beta_code and Tinycus.greek_beta_code_to_unicode. There are other libraries that can do this, such as the ruby library https://github.com/perseids-tools/beta-code-rb as well as standalone software such as the debian package unibetacode. I only rolled my own implementation because it seemed pretty easy to do, and I wanted to be sure that it would generate utf-8 encoded according to modern standards and in a way that would be compatible with the rest of Tinycus.

There is also a function Cleanup.clean_up_greek_beta_code that is meant to clean up stray beta code in documents that were supposed to have been converted to unicode but still have some beta code lingering in them.

Performance

See comments at the top of scripts/benchmark.rb for some notes on algorithms I tried and their performance.