Langa - A language analyzer

Langa was created in a few weeks after a request came up to me, that language recognition would be a fine extension to Lingo (see www.lex-lingo.de). The basic idea of how language recognition could be done, was born after a few minutes. So there had to be a proof of concept. Langa is the proof, that this concept works, with little limitations.

Concept of Language Recognition

Every language has its own charateristical usage of characters. This focuses on the set of characters used, the frequency of each character, the proportion of consonant, vowel and special characters and the appearance of special language specific high frequency words. Langa by now concentrates on the first to subjects, character set and frequency. Therefor Langa processes a textfile and extracts a language specific fingerprint. This fingerprint is comparable. You can measure the distance of several fingerprint and declare the one with the shortest distance as a match.

Langa.dna - The language fingerprints

For the comparism of the fingerprint of a given file with the fingerprints of several languages, we need these fingerprints first. So how do we get them? The easiest way is to take a large file of a given language, process the fingerprint for that file and take this fingerprint as a reference for the language. The first source for large language files was the ‘Wortschatz’ from the university of Leipzig/Germany (see corpora.informatik.uni-leipzig.de/). There are 18 languages in good quality text files and large enough for our purposes. The second source is from the Unbound Bible (see www.unboundbible.org/), where the bible is translated in several languages (see examples/).

Quick Start

Users View

To see how Langa works from a user point of view, from the langa directory call

% bin/langa examples/*
examples/afrikaans_1953_utf8.txt............................Language is Afrikaans (afk)
examples/albanian_utf8.txt..................................Language is Albanian (sqi)
...
examples/wolof_utf8.txt.....................................Language is Wolof (wol)
examples/xhosa_utf8.txt.....................................Language is Xhosa (xho)

Developers View

As a developer, you want to find out, what language a file contains, call

require 'langa'

# => locate langa.dna
this_path = File.dirname(__FILE__)
langa_dna = File.join(this_path, '..', 'lib', 'langa', 'langa.dna')

# => process
la = LanguageAnalyzer.new(langa_dna)
lang = la.analyze(file, codepage)
puts 'Language is %s (%s)' % [la.config(lang)['name'], lang]

See documentation for details.

Add a new language

If you want to add a new language, process as follows:

- Find a textfile that contains lots of written sentences in the desired language.
  The bigger, the better the results. Let's name it i.e. language.txt
- Call langa from the command line with
  % bin/langa --dna language.txt
  please be patient, analyzing takes some time...
  <iso 639-3 code>:
      name:   <full language name>
      iso1:   <iso 639-1 code (optional)>
      source: examples/asv_utf8.txt
      size:   142256
      utf8:   eathondsirlmfuwbcygvpkjzxq
      fingerprint:    101-12616+97-10560+116-8889+104-8721+111-7544+110-7313+100-6081+115-5491+105-5355+114-4844+108-3465+109-2661+102-2396+117-2164+119-2016+98-1755+99-1664+121-1663+103-1587+118-1117+112-954+107-665+106-353+122-52+120-46+113-14
  % 

  Now paste the output (without the patient message) into the langa.dna file. 
  Replace the '<...>' strings with correct values from the iso 639-3 standard (see http://www.sil.org/iso639-3/codes.asp?order=reference_name&letter=a).