RubyDoc.info: File: README – Documentation for interscript (0.1.3)

Introduction

This repository contains interoperable transliteration schemes from:

ALA-LC
BGN/PCGN
ICAO
ISO
UN (by UNGEGN)
Many, many other script conversion system authorities.

The goal is to achieve interoperable transliteration schemes allowing quality comparisons.

Demonstration

These transliteration systems are used in the demo:

bgnpcgn-rus-Cyrl-Latn-1947: BGN/PCGN Romanization of Russian
iso-rus-Cyrl-Latn-iso9: ISO 9 Romanization of Russian
icao-rus-Cyrl-Latn-9303: ICAO MRZ Romanization of Russian
bas-rus-Cyrl-Latn-bss: Bulgaria Academy of Science Streamlined System for Russian

interscript screencast

Installation

Prerequisites

Linux:

apt-get install swig python3-setuptools

Windows:

choco install --no-progress swig

Interscript depends on Python and the sequitur-g2p module

pip3 install setuptools numpy
curl -sSL -o sequitur-g2p.zip https://github.com/sequitur-g2p/sequitur-g2p/archive/806273f.zip
pip3 install sequitur-g2p.zip

Interscript depends on Ruby. Once you manage to install Ruby, it’s easy.

gem install interscript

Usage

Assume you have a file ready in the source script like this:

cat <<EOT > rus-Cyrl.txt
Эх, тройка! птица тройка, кто тебя выдумал? знать, у бойкого народа ты
могла только родиться, в той земле, что не любит шутить, а
ровнем-гладнем разметнулась на полсвета, да и ступай считать версты,
пока не зарябит тебе в очи. И не хитрый, кажись, дорожный снаряд, не
железным схвачен винтом, а наскоро живьём с одним топором да долотом
снарядил и собрал тебя ярославский расторопный мужик. Не в немецких
ботфортах ямщик: борода да рукавицы, и сидит чёрт знает на чём; а
привстал, да замахнулся, да затянул песню — кони вихрем, спицы в
колесах смешались в один гладкий круг, только дрогнула дорога, да
вскрикнул в испуге остановившийся пешеход — и вон она понеслась,
понеслась, понеслась!

Н.В. Гоголь
EOT

You can run interscript on this text using different transliteration systems.

interscript rus-Cyrl.txt \
  --system=bgnpcgn-rus-Cyrl-Latn-1947 \
  --output=bgnpcgn-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=iso-rus-Cyrl-Latn-iso9 \
  --output=iso-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=icao-rus-Cyrl-Latn-9303 \
  --output=icao-rus-Latn.txt

interscript rus-Cyrl.txt \
  --system=bas-rus-Cyrl-Latn-bss \
  --output=bas-rus-Latn.txt

It is then easy to see the exact differences in rendering between the systems.

diff bgnpcgn-rus-Latn.txt bas-rus-Latn.txt

Adding transliteration system

Transliteration systems stored in a maps/ directory as YAML files. You can create a new file and add it to the directory.

The file should be named as <system-code>.yaml, where system-code is in accordance with ISO/CC 24229.

File structure

authority_id: bgnpcgn
id: 1947
language: rus
source_script: Cyrl
destination_script: Latn
name: ROMANIZATION OF RUSSIAN, BGN/PCGN 1947 System
url: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/807920/ROMANIZATION_OF_RUSSIAN.pdf
creation_date: 1947
confirmation_date: 2019-06
description: The BGN/PCGN system for Russian was adopted ...

notes:
  - The character e should be romanized ye initially, after the vowel ...

tests:
  - source: ДЛИННОЕ ПОКРЫВАЛО
    expected: DLINNOYE POKRYVALO
  - source: Еловая шишка
    expected: Yelovaya shishka

map:
  rules:
    - pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 # Е after a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь
      result: Ye
    - pattern: \b\u0415 # Е initially
      result: Ye

  characters:
    "\u0410": "A"
    "\u0411": "B"
    "\u0412": "V"

Rules

The subsection rules is placed under the map key. All rules are applied in order they are placed before the subsection characters applying. Rules apply to an original text, not to a result of previous rules applying.

Each rule has pattern and result elements.

Pattern is a regex expression. It should be representing as a string without // or %r{} parentheses. For example \b\u0415. In case a rule is depend on previous or next content, lookahead or lookbehind could be used. For example a rule with the pattern (?⇐[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 find every Е after upper or lower case symbols a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь.

Result is a replacement a for pattern’s match. It can contain a string, an Unicode characters specified by a hexadecimal number, a captured group reference. String with hexadecimal number or captured group reference should be double quoted. For example "Y\u00eb" or "\\1\u00b7\\2". Captured group are referred by double backslash and group’s number.

Because rules are applied in order, multiple rules applicable to the same segment of a string can be addressed by rule ordering, and rules can be used as priority over characters. For example:

map:
  rules:
    - pattern: \u03B3\u03B3    # γ (before Γ, Ξ, Χ)
      result: ng
    - pattern: (?<![Γγ])\u03B3(?=[ΕεέΗηήΙιίΥυύ])    # γ (before front vowels)
      result: y

(γι maps to yi; but γγ maps to ng. In the case of γγι, the first rule takes priority, and the transliteration is ngi: it makes the second rule impossible.)

map:
  rules:
    - pattern: (?<=\b)\u03BC[πΠ]  # μπ (initially)
      result: b
    - pattern: \u03BC[πΠ]         # μπ (medially)
      result: mb

(The first rule applies at the start of a word; the second rule does not specify a context, as it applies in all other cases not covered by the first rule.)

map:
  rules:
    - pattern: ";"
      result: "?"

  characters
    "\u00B7": ";

(This guarantees that any ; are converted to ? before any new ; are introduced; because all three are Latin script, they could be mixed up in ordering.)

Normally rules “bleed” each other: once a rule applies to a segment, that segment cannot trigger other rules, because it is already converted to Roman. Exceptionally, it will be necessary to have a rule add or remove characters in the original script, rather than transliterate them, so that the same context can be invoked by two rules in succession:

map:
  rules:
    - pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯя])\u042b # Ы after any vowel character
      result: "\u00b7Ы"
    - pattern: \u042b(?=[АаУуЫыЭэ])              # Ы before а, у, ы, or э
      result: "Ы\u00b7"

(If the result were \u00B7Y, the second rule could not be applied afterwards; but we want ОЫУ to transliterate as O·Y·U. In order to make that happen, we preserve the Ы during the rules phase, resulting in О·Ы·У; we only convert the letters to Roman script in the characters phase.)

Testing transliteration systems

To test all transliteration systems in the maps/ directory, run:

bundle exec rspec

The command takes source texts from the test section, transforms them using rules and charmaps from the map key, and compares the results with expected: text from the source: section.

To test a specific transliteration system, set the environment variable TRANSLIT_SYSTEM to the system code of the desired system (i.e. the “basename” of the system’s YAML file):

TRANSLIT_SYSTEM=bgnpcgn-rus-Cyrl-Latn-1947 bundle exec rspec

ISCS system codes

In accordance with ISO/CC 24229, the system code identifying a script conversion system has the following components:

e.g. bgnpcgn-rus-Cyrl-Latn-1947:

bgnpcgn: the authority identifier
rus: an ISO 639-2 3-letter language code that this system applies to
Cyrl: an ISO 15924 script code, identifying the source script
Latn: an ISO 15924 script code, identifying the target script
1947: an identifier unit within the authority to identify this system

Covered languages

Currently the schemes cover Cyrillic, Armenian, Greek, Arabic and Hebrew.

Samples to play with

rus-Cyrl-1.txt: Copied from the XLS output from http://www.primorsk.vybory.izbirkom.ru/region/primorsk?action=show&global=true&root=254017025&tvd=4254017212287&vrn=100100067795849&prver=0&pronetvd=0&region=25&sub_region=25&type=242&vibid=4254017212287
rus-Cyrl-2.txt: Copied from the XLS output from http://www.yaroslavl.vybory.izbirkom.ru/region/yaroslavl?action=show&root=764013001&tvd=4764013188704&vrn=4764013188693&prver=0&pronetvd=0&region=76&sub_region=76&type=426&vibid=4764013188704

References

Reference documents are located at the interscript-references repository. Some specifications that have distribution limitations may not be reproduced there.