Class: Opener::LanguageIdentifier::Detector
- Inherits:
-
Object
- Object
- Opener::LanguageIdentifier::Detector
- Defined in:
- lib/opener/language_identifier/detector.rb
Overview
Ruby wrapper around the Cybozu DetectorFactory and Detector classes. This class automatically handles switching of profiles based on input sizes, assigning priorities to languages, etc.
Constant Summary collapse
- DEFAULT_PROFILES_PATH =
Path to the directory containing the default profiles.
File.( '../../../../core/target/classes/profiles', __FILE__ )
- DEFAULT_SHORT_PROFILES_PATH =
Path to the directory containing the default short profiles.
File.( '../../../../core/target/classes/short_profiles', __FILE__ )
- SHORT_THRESHOLD =
The amount of characters after which the detector should switch to using the longer profiles set.
15- PRIORITIES =
Prioritize OpeNER languages over the rest. Languages not covered by this list are automatically given a default priority.
{ 'en' => 1.0, 'es' => 0.9, 'it' => 0.9, 'fr' => 0.9, 'de' => 0.9, 'nl' => 0.9 }
- DEFAULT_PRIORITY =
The default priority for non OpeNER languages.
0.5
Instance Attribute Summary collapse
-
#profiles_path ⇒ Object
readonly
Returns the value of attribute profiles_path.
-
#short_profiles_path ⇒ Object
readonly
Returns the value of attribute short_profiles_path.
Instance Method Summary collapse
-
#build_priorities(input, languages) ⇒ java.util.HashMap
Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.
- #detect(input) ⇒ String
- #determine_profiles(input) ⇒ String
-
#initialize(options = {}) ⇒ Detector
constructor
A new instance of Detector.
-
#new_detector(input) ⇒ CybozuDetector
Returns a new detector with the profiles set based on the input.
- #probabilities(input) ⇒ Array
- #short_input?(input) ⇒ TrueClass|FalseClass
Constructor Details
#initialize(options = {}) ⇒ Detector
Returns a new instance of Detector.
67 68 69 70 71 72 73 74 |
# File 'lib/opener/language_identifier/detector.rb', line 67 def initialize( = {}) .each do |key, value| instance_variable_set("@#{key}", value) if respond_to?(key) end @profiles_path ||= DEFAULT_PROFILES_PATH @short_profiles_path ||= DEFAULT_SHORT_PROFILES_PATH end |
Instance Attribute Details
#profiles_path ⇒ Object (readonly)
Returns the value of attribute profiles_path.
9 10 11 |
# File 'lib/opener/language_identifier/detector.rb', line 9 def profiles_path @profiles_path end |
#short_profiles_path ⇒ Object (readonly)
Returns the value of attribute short_profiles_path.
9 10 11 |
# File 'lib/opener/language_identifier/detector.rb', line 9 def short_profiles_path @short_profiles_path end |
Instance Method Details
#build_priorities(input, languages) ⇒ java.util.HashMap
Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.
If the input size is smaller than the short profiles threshold non OpeNER languages are disabled. This is to ensure that these languages are detected properly when analysing only 1-2 words.
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# File 'lib/opener/language_identifier/detector.rb', line 126 def build_priorities(input, languages) priorities = java.util.HashMap.new priority = short_input?(input) ? 0.0 : DEFAULT_PRIORITY PRIORITIES.each do |lang, val| priorities.put(lang, val) end languages.each do |language| unless priorities.contains_key(language) priorities.put(language, priority) end end return priorities end |
#detect(input) ⇒ String
79 80 81 |
# File 'lib/opener/language_identifier/detector.rb', line 79 def detect(input) return new_detector(input).detect end |
#determine_profiles(input) ⇒ String
147 148 149 |
# File 'lib/opener/language_identifier/detector.rb', line 147 def determine_profiles(input) return short_input?(input) ? short_profiles_path : profiles_path end |
#new_detector(input) ⇒ CybozuDetector
Returns a new detector with the profiles set based on the input.
This method analyses a lowercased version of the input as this yields better results for short text.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/opener/language_identifier/detector.rb', line 99 def new_detector(input) factory = com.cybozu.labs.langdetect.DetectorFactory.new factory.load_profile(determine_profiles(input)) factory.set_seed(1) priorities = build_priorities(input, factory.langlist) detector = com.cybozu.labs.langdetect.Detector.new(factory) detector.set_prior_map(priorities) detector.append(input.downcase) return detector end |
#probabilities(input) ⇒ Array
86 87 88 |
# File 'lib/opener/language_identifier/detector.rb', line 86 def probabilities(input) return new_detector(input).get_probabilities.to_array end |
#short_input?(input) ⇒ TrueClass|FalseClass
155 156 157 |
# File 'lib/opener/language_identifier/detector.rb', line 155 def short_input?(input) return input.length <= SHORT_THRESHOLD end |