Class: Opener::LanguageIdentifier::Detector

Inherits:
Object
  • Object
show all
Defined in:
lib/opener/language_identifier/detector.rb

Overview

Ruby wrapper around the Cybozu DetectorFactory and Detector classes. This class automatically handles switching of profiles based on input sizes, assigning priorities to languages, etc.

Constant Summary collapse

DEFAULT_PROFILES_PATH =

Path to the directory containing the default profiles.

File.expand_path(
  '../../../../core/target/classes/profiles',
  __FILE__
)
DEFAULT_SHORT_PROFILES_PATH =

Path to the directory containing the default short profiles.

File.expand_path(
  '../../../../core/target/classes/short_profiles',
  __FILE__
)
SHORT_THRESHOLD =

The amount of characters after which the detector should switch to using the longer profiles set.

15
PRIORITIES =

Prioritize OpeNER languages over the rest. Languages not covered by this list are automatically given a default priority.

{
  'en' => 1.0,
  'es' => 0.9,
  'it' => 0.9,
  'fr' => 0.9,
  'de' => 0.9,
  'nl' => 0.9,

  # These languages are disabled (for the time being) due to conflicting
  # with other (OpeNER) languages too often.
  'af' => 0.0, # conflicts with Dutch
}
DEFAULT_PRIORITY =

The default priority for non OpeNER languages.

0.5

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Detector

Returns a new instance of Detector.

Options Hash (options):

  • :profiles_path (String)
  • :short_profiles_path (String)


71
72
73
74
75
76
77
78
# File 'lib/opener/language_identifier/detector.rb', line 71

def initialize(options = {})
  options.each do |key, value|
    instance_variable_set("@#{key}", value) if respond_to?(key)
  end

  @profiles_path       ||= DEFAULT_PROFILES_PATH
  @short_profiles_path ||= DEFAULT_SHORT_PROFILES_PATH
end

Instance Attribute Details

#profiles_pathObject (readonly)

Returns the value of attribute profiles_path.



9
10
11
# File 'lib/opener/language_identifier/detector.rb', line 9

def profiles_path
  @profiles_path
end

#short_profiles_pathObject (readonly)

Returns the value of attribute short_profiles_path.



9
10
11
# File 'lib/opener/language_identifier/detector.rb', line 9

def short_profiles_path
  @short_profiles_path
end

Instance Method Details

#build_priorities(input, languages) ⇒ java.util.HashMap

Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.

If the input size is smaller than the short profiles threshold non OpeNER languages are disabled. This is to ensure that these languages are detected properly when analysing only 1-2 words.



130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# File 'lib/opener/language_identifier/detector.rb', line 130

def build_priorities(input, languages)
  priorities = java.util.HashMap.new
  priority   = short_input?(input) ? 0.0 : DEFAULT_PRIORITY

  PRIORITIES.each do |lang, val|
    priorities.put(lang, val)
  end

  languages.each do |language|
    unless priorities.contains_key(language)
      priorities.put(language, priority)
    end
  end

  return priorities
end

#detect(input) ⇒ String



83
84
85
# File 'lib/opener/language_identifier/detector.rb', line 83

def detect(input)
  return new_detector(input).detect
end

#determine_profiles(input) ⇒ String



151
152
153
# File 'lib/opener/language_identifier/detector.rb', line 151

def determine_profiles(input)
  return short_input?(input) ? short_profiles_path : profiles_path
end

#new_detector(input) ⇒ CybozuDetector

Returns a new detector with the profiles set based on the input.

This method analyses a lowercased version of the input as this yields better results for short text.



103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/opener/language_identifier/detector.rb', line 103

def new_detector(input)
  factory = com.cybozu.labs.langdetect.DetectorFactory.new

  factory.load_profile(determine_profiles(input))
  factory.set_seed(1)

  priorities = build_priorities(input, factory.langlist)
  detector   = com.cybozu.labs.langdetect.Detector.new(factory)

  detector.set_prior_map(priorities)
  detector.append(input.downcase)

  return detector
end

#probabilities(input) ⇒ Array



90
91
92
# File 'lib/opener/language_identifier/detector.rb', line 90

def probabilities(input)
  return new_detector(input).get_probabilities.to_array
end

#short_input?(input) ⇒ TrueClass|FalseClass



159
160
161
# File 'lib/opener/language_identifier/detector.rb', line 159

def short_input?(input)
  return input.length <= SHORT_THRESHOLD
end