Class: Treat::Workers::Extractors::Language::WhatLanguage

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/extractors/language/what_language.rb

Overview

Language detection using a probabilistic algorithm that checks for the presence of words with Bloom filters built from dictionaries for each language.

Original paper: Grothoff. 2007. A Quick Introduction to Bloom Filters. Department of Computer Sciences, Purdue University.

Constant Summary collapse

DefaultOptions =

By default, bias towards common languages.

{
  :bias_toward => [:english, :french, :chinese, :german, :arabic, :spanish]
}
@@detector =

Keep only once instance of the gem class.

nil

Class Method Summary collapse

Class Method Details

.language(entity, options = {}) ⇒ Object

Detect the language of an entity using the ‘whatlanguage’ gem. Return an identifier corresponding to the ISO-639-2 code for the language.

Options:

  • (Array of Symbols) bias => Languages to bias

toward when more than one language is detected with equal probability.



34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/treat/workers/extractors/language/what_language.rb', line 34

def self.language(entity, options = {})

  options = DefaultOptions.merge(options)

  @@detector ||= ::WhatLanguage.new(:all)
  possibilities = @@detector.process_text(entity.to_s)
  lang = {}

  possibilities.each do |k,v|
    lang[k.intern] = v
  end

  max = lang.values.max
  ordered = lang.select { |i,j| j == max }.keys

  ordered.each do |l|
    if options[:bias_toward].include?(l)
      return l
    end
  end

  return ordered.first

end