Class: LanguageAnalyzer
- Inherits:
-
Object
- Object
- LanguageAnalyzer
- Defined in:
- lib/langa/languageanalyzer.rb
Overview
The class LanguageAnalyzer is the heart of Langa. It has two main use
cases:
= Recognize the language of a textfile
In this mode the LanguageAnalyzer identifies the language of a textfile by
comparing the fingerprint of the textfile against the ones documented in
the language configuration file 'language.dna'. Call
la = LanguageAnalyzer.new
la.analyze_file('german-file') -> 'deu'
If you wish additional codepage conversion for the input file call
la.analyze_file('german-file-iso-8859-1', '8859-1') -> 'deu'
= Create a new language fingerprint
If you have a big textfile of a previously unknown language, you can
calculate the fingerprint of this language and add it to the language
configuration file 'language.dna'. Call
la = LanguageAnalyzer.new
la.scan_language_dna('landir/*')
to scan all files from the landir directory. To automatically identify
the iso 639 language codes and the codepage that should be used for reading,
name the input files in a form '<iso-code>.<Language>.<codepage>.txt', i.e.
'landir/deu.German.utf-8.txt'.
Instance Method Summary collapse
-
#analyze(filename, codepage = 'utf-8', full_detail = false) ⇒ Object
Analyze the language of a file.
-
#config(key) ⇒ Object
Get the keys of all known languages.
-
#initialize(language_file = 'language.dna') ⇒ LanguageAnalyzer
constructor
Create a new instance of the LanguageAnalyzer la = LanguageAnalyzer.new.
-
#keys ⇒ Object
Get the keys of all known languages.
-
#scan_language_dna(pattern = '*', codepage = 'utf-8') ⇒ Object
Create a new dna fingerprint for a big language file.
-
#sources ⇒ Object
Get the source files of all known languages.
Constructor Details
#initialize(language_file = 'language.dna') ⇒ LanguageAnalyzer
Create a new instance of the LanguageAnalyzer
la = LanguageAnalyzer.new
64 65 66 |
# File 'lib/langa/languageanalyzer.rb', line 64 def initialize(language_file='language.dna') @languages = Languages.new(language_file) end |
Instance Method Details
#analyze(filename, codepage = 'utf-8', full_detail = false) ⇒ Object
Analyze the language of a file. With the full_detail
toggle you can get a complete protokoll of teh analysis.
la.analyze_file('german-file-utf8') -> 'deu'
la.analyze('german-file-iso-8859-1', '8859-1') -> 'deu'
90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/langa/languageanalyzer.rb', line 90 def analyze( filename, codepage='utf-8', full_detail=false ) dna = DNA.new dna.feed(filename, codepage) fp = dna.fingerprint lang_score = Array.new @languages.keys.each do |key| lang = @languages.config(key) lang_score << [dna.distance(lang['dna']), key, lang['name']] end full_detail ? lang_score.sort {|a,b| a[1]<=>b[1]} : lang_score.sort[0][1] end |
#config(key) ⇒ Object
Get the keys of all known languages.
la.config('deu') -> {'name'=>'German', 'iso1'=>'de', ...}
76 77 78 |
# File 'lib/langa/languageanalyzer.rb', line 76 def config(key) @languages.config(key) end |
#keys ⇒ Object
Get the keys of all known languages.
la.keys -> ['deu', 'eng', ...]
70 71 72 |
# File 'lib/langa/languageanalyzer.rb', line 70 def keys @languages.keys.sort end |
#scan_language_dna(pattern = '*', codepage = 'utf-8') ⇒ Object
Create a new dna fingerprint for a big language file. The file should have at least 100.000 letters. The more, the better for the quality of the fingerprint and therefor for the quality of language recognition. To scan all files from a directory, use a wildcard. To automatically identify the iso 639 language codes and the codepage that should be used for reading, name the input files in a form ‘<iso-code>.<Language>.<codepage>.txt’, i.e. ‘landir/deu.German.utf-8.txt’.
la.scan_language_dna('landir/*')
Copy the output to to the language configuration file ‘language.dna’.
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/langa/languageanalyzer.rb', line 112 def scan_language_dna( pattern = '*', codepage = 'utf-8' ) lang, language, cp = nil, nil, codepage Dir[ pattern ].each do |filename| # filename =~ %r|/([^\.]+)\.([^\.]+)\.([^\.]+)| # lang, language, cp = $1, $2, $3 dna = DNA.new dna.feed(filename, cp) puts Languages.to_paste('<iso 639-3 code>', { 'name' => '<full language name>', 'iso1' => '<iso 639-1 code (optional)>', 'source' => filename, 'size' => dna.size, 'utf8' => dna.to_utf8, 'fingerprint' => dna.to_s }) end end |
#sources ⇒ Object
Get the source files of all known languages.
la.sources -> ["corpora/ger.german.utf-8.txt", ...]
82 83 84 |
# File 'lib/langa/languageanalyzer.rb', line 82 def sources @languages.values_for('source').keys end |