ICU Tournament

Canonicalises and matches person names with Western European characters.

Note: version 1.1.0 dropped support for characters beyond codepoint 255 and became independent of activesupport and i18n.

Installation

Tested with ruby 1.9.2, 1.9.3, 2.0.0 and 2.2.0.

gem install icu_name

Names

This class exists for two main purposes:

  • to normalise to a common format the different ways Irish person names are typed in practice

  • to be able to match two names even if they are not exactly the same in their original form

To create a name object, supply both the first and second names separately to the constructor.

robert = ICU::Name.new(' robert  j ', ' FISHER ')

Capitalisation, white space and punctuation will all be automatically corrected:

robert.name                                                 # => 'Robert J. Fischer'
robert.rname                                                # => 'Fischer, Robert J.'  (reversed name)

The input text, without any changes apart from white-space cleanup and the insertion of a comma (to separate the two names), is returned by the original method:

robert.original                                             # => 'FISCHER, robert j'

To avoid ambiguity when either the first or second names consist of multiple words, it is better to supply the two separately. If the full name is supplied alone to the constructor, without any indication of where the first names end, then the last distinct name is assumed to be the last name.

bobby = ICU::Name.new(' bobby  fischer ')
bobby.first                                                 # => 'Bobby'
bobby.last                                                  # => 'Fischer'

In this case, since the names were not supplied separately, the original text will not contain a comma:

bobby.original                                              # => 'bobby fischer'

Names will match even if one is missing middle initials or if a nickname is used for one of the first names.

bobby.match('Robert J.', 'Fischer')                         # => true

The method alternatives can be used to list alternatives to a given first or last name:

Name.new('Stephen', 'Orr').alternatives(:first)             # => ["Steve"]
Name.new('Michael Stephen', 'Orr').alternatives(:first)     # => ["Steve", "Mike", "Mick", "Mikey"],
Name.new('Oissine', 'Murphy').alternatives(:last)           # => ["Murchadha"],
Name.new('Mark', 'Orr').alternatives(:first)                # => []

By default the class uses a set of first and last name alternatives curated for the ICU. However, this can be customized (see below).

Supplying the match method with strings is equivalent to instantiating an instance with the same strings and then matching it. So, for example the following are equivalent:

robert.match('R.', 'Fischer')                               # => true
robert.match(ICU::Name.new('R.', 'Fischer'))                # => true

Here the inital R matches the first letter of Robert. However, nickname matches will not always work with initials. In the next example, the initial R does not match the first letter B of the nickname Bobby.

bobby.match('R. J.', 'Fischer')                             # => false

Some other ways last names are canonicalised are illustrated below:

ICU::Name.new('John', 'O Reilly').last                      # => "O'Reilly, John"
ICU::Name.new('dave', 'mcmanus').last                       # => "McManus, Dave"

Characters and Encoding

The class can only cope with Latin characters, including those with diacritics (accents). Hyphens, single quotes (which represent apostophes) and letters in the ISO-8859-1 range (e.g. “a”, “è”, “Ö”) are preserved, while everything else is removed (unsupported).

ICU::Name.new('éric', 'PRIÉ').name                          # => "Éric Prié"
ICU::Name.new('BARTŁOMIEJ', 'śliwa').name                   # => "Bartomiej Liwa"
ICU::Name.new('Սմբատ', 'Լպուտյան').name                     # => ""

The various accessors (first, last, name, rname, to_s, original) always return strings encoded in UTF-8, no matter what the input encoding.

eric = ICU::Name.new('éric'.encode("ISO-8859-1"), 'PRIÉ'.force_encoding("ASCII-8BIT"))
eric.rname                                                  # => "Prié, Éric"
eric.rname.encoding.name                                    # => "UTF-8"
eric.original                                               # => "PRIÉ, éric"
eric.original.encoding.name                                 # => "UTF-8"

Accented letters can be transliterated into their US-ASCII counterparts by setting the :chars option, which is available in all accessors. For example:

eric.rname(:chars => "US-ASCII")                            # => "Prie, Eric"
eric.original(:chars => "US-ASCII")                         # => "PRIE, eric"

Note that the character encoding of the strings returned is still UTF-8 in all cases. The same option also relaxes the need for accented characters to match exactly:

eric.match('Eric', 'Prie')                                  # => false
eric.match('Eric', 'Prie', :chars => "US-ASCII")            # => true

Customization of Alternative Names

We saw above how Bobby and Robert were able to match because, by default, the matcher is aware of some common English nicknames. These name alternatives can be customised to handle additional nicknames and other types of alternative names such as common spelling errors and player name changes.

The alternative names consist of two arrays, one for first names and one for last names. Each array element is itself an array of strings representing a set of equivalent names. Here, for example, are some of the default first name alternatives:

["Anthony", "Tony"]
["James", "Jim", "Jimmy", "Jamie"]
["Robert", "Bob", "Bobby"]
["Stephen", "Steve", "Steven"]
["Thomas", "Tom", "Tommy"]

The first of these means that Anthony and Tony are considered equivalent and can match.

ICU::Name.new("Tony", "Miles").match("Anthony", "Miles")    # => true

To change alternative name behaviour, you can replace the default alternatives with a customized set perhaps stored in a database or a YAML file, as illustrated below:

ICU::Name.reset_alternatives
data = YAML.load(File open "my_last_name_alternatives.yaml")
ICU::Name.load_alternatives(:last, data)
data = YAML.load(File open "my_first_name_alternatives.yaml")
ICU::Name.load_alternatives(:first, data)

Note that without the call to reset_alternatives, the new loaded alternatives add to, rather than replace, the defaults.

Other uses of alternatives is to cater for English and Irish versions of the same name, for example (last names):

[Murphy, Murchadha]

or for variations including spelling variations, for example (first names):

[Patrick, Pat, Paddy, Padraig, Padraic, Padhraig, Padhraic]

Conditional Alternatives

Normally, entries in the two arrays are just lists of alternative names. There is one exception to this however, when one of the entries (it doesn’t matter which one but, by convention, the last one) is a regular expression. Here is an example that might be added to the last name alternatives:

["Quinn", "Benjamin", /^(Debbie|Deborah)$/]

What this means is that the last names Quinn and Benjamin match but only when the first name matches the given regular expression. In this case it caters for a female whose last name changed after marriage.

Name.new("Debbie", "Quinn").match("Debbie", "Benjamin")     # => true
Name.new("Mark", "Quinn").match("Mark", "Benjamin")         # => false

Another example, this time for first names, is:

["Sean", "John", /^Bradley$/]

This caters for an individual who is known by two normally unrelated first names. The two first names only match when the last name is Bradley.

Name.new("John", "Bradley").match("Sean", "Bradley")        # => true
Name.new("John", "Alfred").match("Sean", "Alfred")          # => false

Author

Mark Orr, rating officer for the Irish Chess Union (ICU).