Class: Opener::LanguageIdentifier::Detector
- Inherits:
-
Object
- Object
- Opener::LanguageIdentifier::Detector
- Defined in:
- lib/opener/language_identifier/detector.rb
Overview
Ruby wrapper around the Cybozu DetectorFactory and Detector classes. This class automatically handles switching of profiles based on input sizes, assigning priorities to languages, etc.
Constant Summary collapse
- DEFAULT_PROFILES_PATH =
Path to the directory containing the default profiles.
File.( '../../../../core/target/classes/profiles', __FILE__ )
- DEFAULT_SHORT_PROFILES_PATH =
Path to the directory containing the default short profiles.
File.( '../../../../core/target/classes/short_profiles', __FILE__ )
- SHORT_THRESHOLD =
The amount of characters after which the detector should switch to using the longer profiles set.
15- PRIORITIES =
Prioritize OpeNER languages over the rest. Languages not covered by this list are automatically given a default priority.
{ 'en' => 1.0, 'es' => 0.9, 'it' => 0.9, 'fr' => 0.9, 'de' => 0.9, 'nl' => 0.9, # These languages are disabled (for the time being) due to conflicting # with other (OpeNER) languages too often. 'af' => 0.0, # conflicts with Dutch }
- DEFAULT_PRIORITY =
The default priority for non OpeNER languages.
0.5
Instance Attribute Summary collapse
-
#profiles_path ⇒ Object
readonly
Returns the value of attribute profiles_path.
-
#short_profiles_path ⇒ Object
readonly
Returns the value of attribute short_profiles_path.
Instance Method Summary collapse
-
#build_priorities(input, languages) ⇒ java.util.HashMap
Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.
- #detect(input) ⇒ String
- #determine_profiles(input) ⇒ String
-
#initialize(options = {}) ⇒ Detector
constructor
A new instance of Detector.
-
#new_detector(input) ⇒ CybozuDetector
Returns a new detector with the profiles set based on the input.
- #probabilities(input) ⇒ Array
- #short_input?(input) ⇒ TrueClass|FalseClass
Constructor Details
#initialize(options = {}) ⇒ Detector
Returns a new instance of Detector.
71 72 73 74 75 76 77 78 |
# File 'lib/opener/language_identifier/detector.rb', line 71 def initialize( = {}) .each do |key, value| instance_variable_set("@#{key}", value) if respond_to?(key) end @profiles_path ||= DEFAULT_PROFILES_PATH @short_profiles_path ||= DEFAULT_SHORT_PROFILES_PATH end |
Instance Attribute Details
#profiles_path ⇒ Object (readonly)
Returns the value of attribute profiles_path.
9 10 11 |
# File 'lib/opener/language_identifier/detector.rb', line 9 def profiles_path @profiles_path end |
#short_profiles_path ⇒ Object (readonly)
Returns the value of attribute short_profiles_path.
9 10 11 |
# File 'lib/opener/language_identifier/detector.rb', line 9 def short_profiles_path @short_profiles_path end |
Instance Method Details
#build_priorities(input, languages) ⇒ java.util.HashMap
Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.
If the input size is smaller than the short profiles threshold non OpeNER languages are disabled. This is to ensure that these languages are detected properly when analysing only 1-2 words.
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/opener/language_identifier/detector.rb', line 130 def build_priorities(input, languages) priorities = java.util.HashMap.new priority = short_input?(input) ? 0.0 : DEFAULT_PRIORITY PRIORITIES.each do |lang, val| priorities.put(lang, val) end languages.each do |language| unless priorities.contains_key(language) priorities.put(language, priority) end end return priorities end |
#detect(input) ⇒ String
83 84 85 |
# File 'lib/opener/language_identifier/detector.rb', line 83 def detect(input) return new_detector(input).detect end |
#determine_profiles(input) ⇒ String
151 152 153 |
# File 'lib/opener/language_identifier/detector.rb', line 151 def determine_profiles(input) return short_input?(input) ? short_profiles_path : profiles_path end |
#new_detector(input) ⇒ CybozuDetector
Returns a new detector with the profiles set based on the input.
This method analyses a lowercased version of the input as this yields better results for short text.
103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/opener/language_identifier/detector.rb', line 103 def new_detector(input) factory = com.cybozu.labs.langdetect.DetectorFactory.new factory.load_profile(determine_profiles(input)) factory.set_seed(1) priorities = build_priorities(input, factory.langlist) detector = com.cybozu.labs.langdetect.Detector.new(factory) detector.set_prior_map(priorities) detector.append(input.downcase) return detector end |
#probabilities(input) ⇒ Array
90 91 92 |
# File 'lib/opener/language_identifier/detector.rb', line 90 def probabilities(input) return new_detector(input).get_probabilities.to_array end |
#short_input?(input) ⇒ TrueClass|FalseClass
159 160 161 |
# File 'lib/opener/language_identifier/detector.rb', line 159 def short_input?(input) return input.length <= SHORT_THRESHOLD end |