Class: Opener::LanguageIdentifier::Detector

Inherits:
Object
  • Object
show all
Defined in:
lib/opener/language_identifier/detector.rb

Overview

Ruby wrapper around the Cybozu DetectorFactory and Detector classes. This class automatically handles switching of profiles based on input sizes, assigning priorities to languages, etc.

Constant Summary collapse

DEFAULT_PROFILES_PATH =

Path to the directory containing the default profiles.

Returns:

  • (String)
File.expand_path(
  '../../../../core/target/classes/profiles',
  __FILE__
)
DEFAULT_SHORT_PROFILES_PATH =

Path to the directory containing the default short profiles.

Returns:

  • (String)
File.expand_path(
  '../../../../core/target/classes/short_profiles',
  __FILE__
)
SHORT_THRESHOLD =

The amount of characters after which the detector should switch to using the longer profiles set.

Returns:

  • (Fixnum)
15
PRIORITIES =

Prioritize OpeNER languages over the rest. Languages not covered by this list are automatically given a default priority.

Returns:

  • (Hash)
{
  'en' => 1.0,
  'es' => 0.9,
  'it' => 0.9,
  'fr' => 0.9,
  'de' => 0.9,
  'nl' => 0.9
}
DEFAULT_PRIORITY =

The default priority for non OpeNER languages.

Returns:

  • (Float)
0.5

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Detector

Returns a new instance of Detector.

Parameters:

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :profiles_path (String)
  • :short_profiles_path (String)


67
68
69
70
71
72
73
74
# File 'lib/opener/language_identifier/detector.rb', line 67

def initialize(options = {})
  options.each do |key, value|
    instance_variable_set("@#{key}", value) if respond_to?(key)
  end

  @profiles_path       ||= DEFAULT_PROFILES_PATH
  @short_profiles_path ||= DEFAULT_SHORT_PROFILES_PATH
end

Instance Attribute Details

#profiles_pathObject (readonly)

Returns the value of attribute profiles_path.



9
10
11
# File 'lib/opener/language_identifier/detector.rb', line 9

def profiles_path
  @profiles_path
end

#short_profiles_pathObject (readonly)

Returns the value of attribute short_profiles_path.



9
10
11
# File 'lib/opener/language_identifier/detector.rb', line 9

def short_profiles_path
  @short_profiles_path
end

Instance Method Details

#build_priorities(input, languages) ⇒ java.util.HashMap

Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.

If the input size is smaller than the short profiles threshold non OpeNER languages are disabled. This is to ensure that these languages are detected properly when analysing only 1-2 words.

Parameters:

  • input (String)
  • languages (Array<String>)

Returns:

  • (java.util.HashMap)


126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/opener/language_identifier/detector.rb', line 126

def build_priorities(input, languages)
  priorities = java.util.HashMap.new
  priority   = short_input?(input) ? 0.0 : DEFAULT_PRIORITY

  PRIORITIES.each do |lang, val|
    priorities.put(lang, val)
  end

  languages.each do |language|
    unless priorities.contains_key(language)
      priorities.put(language, priority)
    end
  end

  return priorities
end

#detect(input) ⇒ String

Returns:

  • (String)


79
80
81
# File 'lib/opener/language_identifier/detector.rb', line 79

def detect(input)
  return new_detector(input).detect
end

#determine_profiles(input) ⇒ String

Parameters:

  • input (String)

Returns:

  • (String)


147
148
149
# File 'lib/opener/language_identifier/detector.rb', line 147

def determine_profiles(input)
  return short_input?(input) ? short_profiles_path : profiles_path
end

#new_detector(input) ⇒ CybozuDetector

Returns a new detector with the profiles set based on the input.

This method analyses a lowercased version of the input as this yields better results for short text.

Parameters:

  • input (String)

Returns:

  • (CybozuDetector)


99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/opener/language_identifier/detector.rb', line 99

def new_detector(input)
  factory = com.cybozu.labs.langdetect.DetectorFactory.new

  factory.load_profile(determine_profiles(input))
  factory.set_seed(1)

  priorities = build_priorities(input, factory.langlist)
  detector   = com.cybozu.labs.langdetect.Detector.new(factory)

  detector.set_prior_map(priorities)
  detector.append(input.downcase)

  return detector
end

#probabilities(input) ⇒ Array

Returns:

  • (Array)


86
87
88
# File 'lib/opener/language_identifier/detector.rb', line 86

def probabilities(input)
  return new_detector(input).get_probabilities.to_array
end

#short_input?(input) ⇒ TrueClass|FalseClass

Parameters:

  • input (String)

Returns:

  • (TrueClass|FalseClass)


155
156
157
# File 'lib/opener/language_identifier/detector.rb', line 155

def short_input?(input)
  return input.length <= SHORT_THRESHOLD
end