Class: Opener::LanguageIdentifier::Detector

Inherits:
Object
  • Object
show all
Defined in:
lib/opener/language_identifier/detector.rb

Overview

Ruby wrapper around the Cybozu DetectorFactory and Detector classes. This class automatically handles switching of profiles based on input sizes, assigning priorities to languages, etc.

Constant Summary collapse

DEFAULT_PROFILES_PATH =

Path to the directory containing the default profiles.

Returns:

  • (String)
File.expand_path(
  '../../../../core/target/classes/profiles',
  __FILE__
)
DEFAULT_SHORT_PROFILES_PATH =

Path to the directory containing the default short profiles.

Returns:

  • (String)
File.expand_path(
  '../../../../core/target/classes/short_profiles',
  __FILE__
)
SHORT_THRESHOLD =

The amount of characters after which the detector should switch to using the longer profiles set.

Returns:

  • (Fixnum)
15
PRIORITIES =

Prioritize OpeNER languages over the rest. Languages not covered by this list are automatically given a default priority.

Returns:

  • (Hash)
{
  'en' => 1.0,
  'es' => 0.9,
  'it' => 0.9,
  'fr' => 0.9,
  'de' => 0.9,
  'nl' => 0.9,

  # These languages are disabled (for the time being) due to conflicting
  # with other (OpeNER) languages too often.
  'af' => 0.0, # conflicts with Dutch
}
DEFAULT_PRIORITY =

The default priority for non OpeNER languages.

Returns:

  • (Float)
0.5

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Detector

Returns a new instance of Detector.

Parameters:

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :profiles_path (String)
  • :short_profiles_path (String)


71
72
73
74
75
76
77
78
# File 'lib/opener/language_identifier/detector.rb', line 71

def initialize(options = {})
  options.each do |key, value|
    instance_variable_set("@#{key}", value) if respond_to?(key)
  end

  @profiles_path       ||= DEFAULT_PROFILES_PATH
  @short_profiles_path ||= DEFAULT_SHORT_PROFILES_PATH
end

Instance Attribute Details

#profiles_pathObject (readonly)

Returns the value of attribute profiles_path.



9
10
11
# File 'lib/opener/language_identifier/detector.rb', line 9

def profiles_path
  @profiles_path
end

#short_profiles_pathObject (readonly)

Returns the value of attribute short_profiles_path.



9
10
11
# File 'lib/opener/language_identifier/detector.rb', line 9

def short_profiles_path
  @short_profiles_path
end

Instance Method Details

#build_priorities(input, languages) ⇒ java.util.HashMap

Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.

If the input size is smaller than the short profiles threshold non OpeNER languages are disabled. This is to ensure that these languages are detected properly when analysing only 1-2 words.

Parameters:

  • input (String)
  • languages (Array<String>)

Returns:

  • (java.util.HashMap)


136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# File 'lib/opener/language_identifier/detector.rb', line 136

def build_priorities(input, languages)
  priorities = java.util.HashMap.new
  priority   = short_input?(input) ? 0.0 : DEFAULT_PRIORITY

  PRIORITIES.each do |lang, val|
    priorities.put(lang, val)
  end

  languages.each do |language|
    unless priorities.contains_key(language)
      priorities.put(language, priority)
    end
  end

  return priorities
end

#detect(input) ⇒ String

Returns:

  • (String)


83
84
85
86
87
88
89
90
91
# File 'lib/opener/language_identifier/detector.rb', line 83

def detect(input)
  return new_detector(input).detect

# The core Java code raise an exception when it can't detect a language.
# Since this isn't actually something fatal we'll capture this and return
# "unknown" instead.
rescue com.cybozu.labs.langdetect.LangDetectException
  return 'unknown'
end

#determine_profiles(input) ⇒ String

Parameters:

  • input (String)

Returns:

  • (String)


157
158
159
# File 'lib/opener/language_identifier/detector.rb', line 157

def determine_profiles(input)
  return short_input?(input) ? short_profiles_path : profiles_path
end

#new_detector(input) ⇒ CybozuDetector

Returns a new detector with the profiles set based on the input.

This method analyses a lowercased version of the input as this yields better results for short text.

Parameters:

  • input (String)

Returns:

  • (CybozuDetector)


109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/opener/language_identifier/detector.rb', line 109

def new_detector(input)
  factory = com.cybozu.labs.langdetect.DetectorFactory.new

  factory.load_profile(determine_profiles(input))
  factory.set_seed(1)

  priorities = build_priorities(input, factory.langlist)
  detector   = com.cybozu.labs.langdetect.Detector.new(factory)

  detector.set_prior_map(priorities)
  detector.append(input.downcase)

  return detector
end

#probabilities(input) ⇒ Array

Returns:

  • (Array)


96
97
98
# File 'lib/opener/language_identifier/detector.rb', line 96

def probabilities(input)
  return new_detector(input).get_probabilities.to_array
end

#short_input?(input) ⇒ TrueClass|FalseClass

Parameters:

  • input (String)

Returns:

  • (TrueClass|FalseClass)


165
166
167
# File 'lib/opener/language_identifier/detector.rb', line 165

def short_input?(input)
  return input.length <= SHORT_THRESHOLD
end