Class: LanguageAnalyzer

Inherits:

Object

Object
LanguageAnalyzer

Defined in:: lib/langa/languageanalyzer.rb

Overview

The class LanguageAnalyzer is the heart of Langa. It has two main use

cases:

= Recognize the language of a textfile
In this mode the LanguageAnalyzer identifies the language of a textfile by
comparing the fingerprint of the textfile against the ones documented in
the language configuration file 'language.dna'. Call

  la = LanguageAnalyzer.new
  la.analyze_file('german-file') -> 'deu'

If you wish additional codepage conversion for the input file call

  la.analyze_file('german-file-iso-8859-1', '8859-1') -> 'deu'

= Create a new language fingerprint
If you have a big textfile of a previously unknown language, you can 
calculate the fingerprint of this language and add it to the language
configuration file 'language.dna'. Call

  la = LanguageAnalyzer.new
  la.scan_language_dna('landir/*')

to scan all files from the landir directory. To automatically identify
the iso 639 language codes and the codepage that should be used for reading,
name the input files in a form '<iso-code>.<Language>.<codepage>.txt', i.e.
'landir/deu.German.utf-8.txt'.

Instance Method Summary collapse

#analyze(filename, codepage = 'utf-8', full_detail = false) ⇒ Object

Analyze the language of a file.
#config(key) ⇒ Object

Get the keys of all known languages.
#initialize(language_file = 'language.dna') ⇒ LanguageAnalyzer constructor

Create a new instance of the LanguageAnalyzer la = LanguageAnalyzer.new.
#keys ⇒ Object

Get the keys of all known languages.
#scan_language_dna(pattern = '*', codepage = 'utf-8') ⇒ Object

Create a new dna fingerprint for a big language file.
#sources ⇒ Object

Get the source files of all known languages.

Constructor Details

#initialize(language_file = 'language.dna') ⇒ `LanguageAnalyzer`

Create a new instance of the LanguageAnalyzer

la = LanguageAnalyzer.new



64
65
66

# File 'lib/langa/languageanalyzer.rb', line 64

def initialize(language_file='language.dna')
  @languages = Languages.new(language_file)
end

Instance Method Details

#analyze(filename, codepage = 'utf-8', full_detail = false) ⇒ `Object`

Analyze the language of a file. With the full_detail toggle you can get a complete protokoll of teh analysis.

la.analyze_file('german-file-utf8') -> 'deu'
la.analyze('german-file-iso-8859-1', '8859-1') -> 'deu'

# File 'lib/langa/languageanalyzer.rb', line 90

def analyze( filename, codepage='utf-8', full_detail=false )
  dna = DNA.new
  dna.feed(filename, codepage)
  fp = dna.fingerprint
  
  lang_score = Array.new
  @languages.keys.each do |key|
    lang = @languages.config(key)
    lang_score << [dna.distance(lang['dna']), key, lang['name']]
  end
  full_detail ? lang_score.sort {|a,b| a[1]<=>b[1]}  : lang_score.sort[0][1]
end

#config(key) ⇒ `Object`

Get the keys of all known languages.

la.config('deu') -> {'name'=>'German', 'iso1'=>'de', ...}



76
77
78

# File 'lib/langa/languageanalyzer.rb', line 76

def config(key)
  @languages.config(key)
end

#keys ⇒ `Object`

Get the keys of all known languages.

la.keys -> ['deu', 'eng', ...]



70
71
72

# File 'lib/langa/languageanalyzer.rb', line 70

def keys
  @languages.keys.sort
end

#scan_language_dna(pattern = '*', codepage = 'utf-8') ⇒ `Object`

Create a new dna fingerprint for a big language file. The file should have at least 100.000 letters. The more, the better for the quality of the fingerprint and therefor for the quality of language recognition. To scan all files from a directory, use a wildcard. To automatically identify the iso 639 language codes and the codepage that should be used for reading, name the input files in a form ‘<iso-code>.<Language>.<codepage>.txt’, i.e. ‘landir/deu.German.utf-8.txt’.

la.scan_language_dna('landir/*')

Copy the output to to the language configuration file ‘language.dna’.

# File 'lib/langa/languageanalyzer.rb', line 112

def scan_language_dna( pattern = '*', codepage = 'utf-8' )
  lang, language, cp = nil, nil, codepage
  Dir[ pattern ].each do |filename|
#      filename =~ %r|/([^\.]+)\.([^\.]+)\.([^\.]+)|
#      lang, language, cp = $1, $2, $3

    dna = DNA.new
    dna.feed(filename, cp)

    puts Languages.to_paste('<iso 639-3 code>', {
      'name' => '<full language name>',
      'iso1' => '<iso 639-1 code (optional)>',
      'source' => filename,
      'size' => dna.size,
      'utf8' => dna.to_utf8,
      'fingerprint' => dna.to_s })
  end
end

#sources ⇒ `Object`

Get the source files of all known languages.

la.sources -> ["corpora/ger.german.utf-8.txt", ...]



82
83
84

# File 'lib/langa/languageanalyzer.rb', line 82

def sources
  @languages.values_for('source').keys
end

Class: LanguageAnalyzer

Overview

Instance Method Summary collapse

Constructor Details

#initialize(language_file = 'language.dna') ⇒ LanguageAnalyzer

Instance Method Details

#analyze(filename, codepage = 'utf-8', full_detail = false) ⇒ Object

#config(key) ⇒ Object

#keys ⇒ Object