Class: LanguageAnalyzer

Inherits:
Object
  • Object
show all
Defined in:
lib/langa/languageanalyzer.rb

Overview

The class LanguageAnalyzer is the heart of Langa. It has two main use

cases:

= Recognize the language of a textfile
In this mode the LanguageAnalyzer identifies the language of a textfile by
comparing the fingerprint of the textfile against the ones documented in
the language configuration file 'language.dna'. Call

  la = LanguageAnalyzer.new
  la.analyze_file('german-file') -> 'deu'

If you wish additional codepage conversion for the input file call

  la.analyze_file('german-file-iso-8859-1', '8859-1') -> 'deu'

= Create a new language fingerprint
If you have a big textfile of a previously unknown language, you can 
calculate the fingerprint of this language and add it to the language
configuration file 'language.dna'. Call

  la = LanguageAnalyzer.new
  la.scan_language_dna('landir/*')

to scan all files from the landir directory. To automatically identify
the iso 639 language codes and the codepage that should be used for reading,
name the input files in a form '<iso-code>.<Language>.<codepage>.txt', i.e.
'landir/deu.German.utf-8.txt'.

Instance Method Summary collapse

Constructor Details

#initialize(language_file = 'language.dna') ⇒ LanguageAnalyzer

Create a new instance of the LanguageAnalyzer

la = LanguageAnalyzer.new


64
65
66
# File 'lib/langa/languageanalyzer.rb', line 64

def initialize(language_file='language.dna')
  @languages = Languages.new(language_file)
end

Instance Method Details

#analyze(filename, codepage = 'utf-8', full_detail = false) ⇒ Object

Analyze the language of a file. With the full_detail toggle you can get a complete protokoll of teh analysis.

la.analyze_file('german-file-utf8') -> 'deu'
la.analyze('german-file-iso-8859-1', '8859-1') -> 'deu'


90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/langa/languageanalyzer.rb', line 90

def analyze( filename, codepage='utf-8', full_detail=false )
  dna = DNA.new
  dna.feed(filename, codepage)
  fp = dna.fingerprint
  
  lang_score = Array.new
  @languages.keys.each do |key|
    lang = @languages.config(key)
    lang_score << [dna.distance(lang['dna']), key, lang['name']]
  end
  full_detail ? lang_score.sort {|a,b| a[1]<=>b[1]}  : lang_score.sort[0][1]
end

#config(key) ⇒ Object

Get the keys of all known languages.

la.config('deu') -> {'name'=>'German', 'iso1'=>'de', ...}


76
77
78
# File 'lib/langa/languageanalyzer.rb', line 76

def config(key)
  @languages.config(key)
end

#keysObject

Get the keys of all known languages.

la.keys -> ['deu', 'eng', ...]


70
71
72
# File 'lib/langa/languageanalyzer.rb', line 70

def keys
  @languages.keys.sort
end

#scan_language_dna(pattern = '*', codepage = 'utf-8') ⇒ Object

Create a new dna fingerprint for a big language file. The file should have at least 100.000 letters. The more, the better for the quality of the fingerprint and therefor for the quality of language recognition. To scan all files from a directory, use a wildcard. To automatically identify the iso 639 language codes and the codepage that should be used for reading, name the input files in a form ‘<iso-code>.<Language>.<codepage>.txt’, i.e. ‘landir/deu.German.utf-8.txt’.

la.scan_language_dna('landir/*')

Copy the output to to the language configuration file ‘language.dna’.



112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# File 'lib/langa/languageanalyzer.rb', line 112

def scan_language_dna( pattern = '*', codepage = 'utf-8' )
  lang, language, cp = nil, nil, codepage
  Dir[ pattern ].each do |filename|
#      filename =~ %r|/([^\.]+)\.([^\.]+)\.([^\.]+)|
#      lang, language, cp = $1, $2, $3

    dna = DNA.new
    dna.feed(filename, cp)

    puts Languages.to_paste('<iso 639-3 code>', {
      'name' => '<full language name>',
      'iso1' => '<iso 639-1 code (optional)>',
      'source' => filename,
      'size' => dna.size,
      'utf8' => dna.to_utf8,
      'fingerprint' => dna.to_s })
  end
end

#sourcesObject

Get the source files of all known languages.

la.sources -> ["corpora/ger.german.utf-8.txt", ...]


82
83
84
# File 'lib/langa/languageanalyzer.rb', line 82

def sources
  @languages.values_for('source').keys
end