ffi-locale

A small gem to aid with locale-sensitive string comparison (collation), which ruby lacks by default. Roughly based on Matz' rather ancient code. However, instead of creating a wrapper around these functions, I call them using FFI.

Build Status

Scope

Everything this library does could be accomplished by adding two functions to ffi-libc. However, I didn't need any of the extra bindings ffi-libc would bring, and decided to separate the functionality.

The library offers only 4 functions, all of them thin wrappers over libc functionality:

Audience

You don't need ffi-locale if you:

  • are using your ORM & RDBMS to sort strings - both major opensource DBs have had good or decent support for years
  • will only ever be using ASCII
  • think i18n is only about translating some messages

You need ffi-locale if you:

Alternatives

  • twitter_cldr offers the same functionality, and much, much more.
  • ICU has collation, encoding detection and more.
  • sort-alphabetical does a kind of collation that sorts accented letters same as their non-accented counterparts. It's not proper locale-sensitive collation, but might fit your needs.

Installation

Add this line to your Gemfile:

gem 'ffi-locale', github: 'k3rni/ffi-locale'

You need to install the GitHub version of this gem, because it was never pushed to RubyGems due to naming conflicts. RubyGems has seanohalpin's very similar gem under this name. Check for that before reporting errors.

Usage

strcoll approach (individual string comparison: transformation and comparison in one step):
irb> FFILocale.setlocale FFILocale::LC_COLLATE, 'pl_PL.UTF8'
irb> FFILocale.strcoll "łyk", "myk"
-1 # Correct collation order. In Polish alphabet, 'ł' comes between 'l' and 'm'.
irb> "łyk" <=> "myk"
1 # Incorrect collation. Correct with respect to Ruby semantics, which compares bytewise.
irb> %w(m l ł).sort { |a, b| FFILocale.strcoll a, b }
["l", "ł", "m"]
strxfrm approach (mass string sorting: bulk-transform first, then rely on Ruby built-in string comparison):
irb> strings = %w(Ágnes Andor Cecil Cvi Csaba Elemér Éva Géza Gizella György Győző Lóránd Lotár Lőrinc Lukács Orsolya Ödön Ulrika Üllő)
irb> FFILocale.setlocale FFILocale::LC_COLLATE, 'hu-HU.UTF8'
irb> sorted = strings.shuffle.sort_by{|s| FFILocale.strxfrm(s)}
=> ["Ágnes", "Andor", "Cecil", "Cvi", "Csaba", "Elemér", "Éva", "Géza", "Gizella", "György", "Győző", "Lóránd", "Lotár", "Lőrinc", "Lukács", "Orsolya", "Ödön", "Ulrika", "Üllő"]
irb> sorted == strings
true

One advantage of using strxfrm with sort_by is performace: the collation transform is computed only once for each item; another is that sort_by makes it easier to sort by a compound value (e.g. multiple columns):

irb> FFILocale.setlocale FFILocale::LC_COLLATE, 'hu-HU.UTF8'
irb> [{name: "Ágnes", id: 789}, {name: "Andor", id: 456}, {name: "Ágnes", id: 123}].sort_by{|u| [FFILocale.strxfrm(u[:name]), u[:id]] }
=> [{:name=>"Ágnes", :id=>123}, {:name=>"Ágnes", :id=>789}, {:name=>"Andor", :id=>456}]

Not implemented

  • Extensions to String class, to facilitate collation.
  • Altering default String sort order. Bad idea - won't be implemented.
  • Extensions to Array or Enumerable, to add or alter sort methods. Unnecessary, because passing blocks to sort and sort_by solves the issue (see example above).
  • Not tested beyond Linux. Patches are welcome.

Copyright

Copyright © 2011-2015 Krzysztof Zych. See LICENSE.txt for further details.