Class: Treat::Workers::Extractors::Language::WhatLanguage
- Inherits:
-
Object
- Object
- Treat::Workers::Extractors::Language::WhatLanguage
- Defined in:
- lib/treat/workers/extractors/language/what_language.rb
Overview
Language detection using a probabilistic algorithm that checks for the presence of words with Bloom filters built from dictionaries for each language.
Original paper: Grothoff. 2007. A Quick Introduction to Bloom Filters. Department of Computer Sciences, Purdue University.
Constant Summary collapse
- DefaultOptions =
By default, bias towards common languages.
{ :bias_toward => [:english, :french, :chinese, :german, :arabic, :spanish] }
- @@detector =
Keep only once instance of the gem class.
nil
Class Method Summary collapse
-
.language(entity, options = {}) ⇒ Object
Detect the language of an entity using the ‘whatlanguage’ gem.
Class Method Details
.language(entity, options = {}) ⇒ Object
Detect the language of an entity using the ‘whatlanguage’ gem. Return an identifier corresponding to the ISO-639-2 code for the language.
Options:
-
(Array of Symbols) bias => Languages to bias
toward when more than one language is detected with equal probability.
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/treat/workers/extractors/language/what_language.rb', line 34 def self.language(entity, = {}) = DefaultOptions.merge() @@detector ||= ::WhatLanguage.new(:all) possibilities = @@detector.process_text(entity.to_s) lang = {} possibilities.each do |k,v| lang[k.intern] = v end max = lang.values.max ordered = lang.select { |i,j| j == max }.keys ordered.each do |l| if [:bias_toward].include?(l) return l end end return ordered.first end |