Class: Opener::LanguageIdentifier::Detector
- Inherits:
-
Object
- Object
- Opener::LanguageIdentifier::Detector
- Defined in:
- lib/opener/language_identifier/detector.rb
Overview
Ruby wrapper around the Cybozu DetectorFactory and Detector classes. This class automatically handles switching of profiles based on input sizes, assigning priorities to languages, etc.
Constant Summary collapse
- DEFAULT_PROFILES_PATH =
Path to the directory containing the default profiles.
File.( '../../../../core/target/classes/profiles', __FILE__ )
- DEFAULT_SHORT_PROFILES_PATH =
Path to the directory containing the default short profiles.
File.( '../../../../core/target/classes/short_profiles', __FILE__ )
- SHORT_THRESHOLD =
The amount of characters after which the detector should switch to using the longer profiles set.
15
- PRIORITIES =
Prioritize OpeNER languages over the rest. Languages not covered by this list are automatically given a default priority.
{ 'en' => 1.0, 'es' => 0.9, 'it' => 0.9, 'fr' => 0.9, 'de' => 0.9, 'nl' => 0.9, # These languages are disabled (for the time being) due to conflicting # with other (OpeNER) languages too often. 'af' => 0.0, # conflicts with Dutch }
- DEFAULT_PRIORITY =
The default priority for non OpeNER languages.
0.5
Instance Attribute Summary collapse
-
#profiles_path ⇒ Object
readonly
Returns the value of attribute profiles_path.
-
#short_profiles_path ⇒ Object
readonly
Returns the value of attribute short_profiles_path.
Instance Method Summary collapse
-
#build_priorities(input, languages) ⇒ java.util.HashMap
Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.
- #detect(input) ⇒ String
- #determine_profiles(input) ⇒ String
-
#initialize(options = {}) ⇒ Detector
constructor
A new instance of Detector.
-
#new_detector(input) ⇒ CybozuDetector
Returns a new detector with the profiles set based on the input.
- #probabilities(input) ⇒ Array
- #short_input?(input) ⇒ TrueClass|FalseClass
Constructor Details
#initialize(options = {}) ⇒ Detector
Returns a new instance of Detector.
71 72 73 74 75 76 77 78 |
# File 'lib/opener/language_identifier/detector.rb', line 71 def initialize( = {}) .each do |key, value| instance_variable_set("@#{key}", value) if respond_to?(key) end @profiles_path ||= DEFAULT_PROFILES_PATH @short_profiles_path ||= DEFAULT_SHORT_PROFILES_PATH end |
Instance Attribute Details
#profiles_path ⇒ Object (readonly)
Returns the value of attribute profiles_path.
9 10 11 |
# File 'lib/opener/language_identifier/detector.rb', line 9 def profiles_path @profiles_path end |
#short_profiles_path ⇒ Object (readonly)
Returns the value of attribute short_profiles_path.
9 10 11 |
# File 'lib/opener/language_identifier/detector.rb', line 9 def short_profiles_path @short_profiles_path end |
Instance Method Details
#build_priorities(input, languages) ⇒ java.util.HashMap
Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.
If the input size is smaller than the short profiles threshold non OpeNER languages are disabled. This is to ensure that these languages are detected properly when analysing only 1-2 words.
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
# File 'lib/opener/language_identifier/detector.rb', line 136 def build_priorities(input, languages) priorities = java.util.HashMap.new priority = short_input?(input) ? 0.0 : DEFAULT_PRIORITY PRIORITIES.each do |lang, val| priorities.put(lang, val) end languages.each do |language| unless priorities.contains_key(language) priorities.put(language, priority) end end return priorities end |
#detect(input) ⇒ String
83 84 85 86 87 88 89 90 91 |
# File 'lib/opener/language_identifier/detector.rb', line 83 def detect(input) return new_detector(input).detect # The core Java code raise an exception when it can't detect a language. # Since this isn't actually something fatal we'll capture this and return # "unknown" instead. rescue com.cybozu.labs.langdetect.LangDetectException return 'unknown' end |
#determine_profiles(input) ⇒ String
157 158 159 |
# File 'lib/opener/language_identifier/detector.rb', line 157 def determine_profiles(input) return short_input?(input) ? short_profiles_path : profiles_path end |
#new_detector(input) ⇒ CybozuDetector
Returns a new detector with the profiles set based on the input.
This method analyses a lowercased version of the input as this yields better results for short text.
109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/opener/language_identifier/detector.rb', line 109 def new_detector(input) factory = com.cybozu.labs.langdetect.DetectorFactory.new factory.load_profile(determine_profiles(input)) factory.set_seed(1) priorities = build_priorities(input, factory.langlist) detector = com.cybozu.labs.langdetect.Detector.new(factory) detector.set_prior_map(priorities) detector.append(input.downcase) return detector end |
#probabilities(input) ⇒ Array
96 97 98 |
# File 'lib/opener/language_identifier/detector.rb', line 96 def probabilities(input) return new_detector(input).get_probabilities.to_array end |
#short_input?(input) ⇒ TrueClass|FalseClass
165 166 167 |
# File 'lib/opener/language_identifier/detector.rb', line 165 def short_input?(input) return input.length <= SHORT_THRESHOLD end |