Class: CLD3::NNetLanguageIdentifier
- Inherits:
-
Object
- Object
- CLD3::NNetLanguageIdentifier
- Defined in:
- lib/cld3.rb
Overview
Class for detecting the language of a document.
Defined Under Namespace
Classes: Result
Constant Summary collapse
- MIN_NUM_BYTES_TO_CONSIDER =
Min number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.
140
- MAX_NUM_BYTES_TO_CONSIDER =
Max number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.
700
- MAX_NUM_INPUT_BYTES_TO_CONSIDER =
Max number of input bytes to process. This is Numeric object.
10000
- RELIABILITY_THRESHOLD =
Predictions with probability greater than or equal to this threshold are marked as reliable. This threshold was optimized on a set of text segments extracted from wikipedia, and results in an overall precision, recall, and f1 equal to 0.9760, 0.9624, and 0.9692, respectively. This is Numeric object.
0.7
- RELIABILITY_HR_BS_THRESHOLD =
Reliability threshold for the languages hr and bs. This is Numeric object.
0.5
Instance Method Summary collapse
-
#find_language(text) ⇒ Object
Finds the most likely language for the given text, along with additional information (e.g., probability).
-
#initialize(minNumBytes = MIN_NUM_BYTES_TO_CONSIDER, maxNumBytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ NNetLanguageIdentifier
constructor
The arguments are two String objects.
Constructor Details
#initialize(minNumBytes = MIN_NUM_BYTES_TO_CONSIDER, maxNumBytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ NNetLanguageIdentifier
The arguments are two String objects.
67 68 69 |
# File 'lib/cld3.rb', line 67 def initialize(minNumBytes = MIN_NUM_BYTES_TO_CONSIDER, maxNumBytes = MAX_NUM_BYTES_TO_CONSIDER) @cc = Unstable::NNetLanguageIdentifier::Pointer.new(Unstable.new_NNetLanguageIdentifier(minNumBytes, maxNumBytes)) end |
Instance Method Details
#find_language(text) ⇒ Object
Finds the most likely language for the given text, along with additional information (e.g., probability). The prediction is based on the first N bytes where N is the minumum between the number of interchange valid UTF8 bytes and max_num_bytes_
. If N is less than min_num_bytes_
long, then this function returns nil as language. The argument is a String object. The returned value of this function is an instance of Result.
78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/cld3.rb', line 78 def find_language(text) text_utf8 = text.encode(Encoding::UTF_8) pointer = FFI::MemoryPointer.new(:char, text_utf8.bytesize) pointer.put_bytes(0, text_utf8) cc_result = Unstable.NNetLanguageIdentifier_find_language(@cc, pointer, text_utf8.bytesize) language = cc_result[:language_data].read_bytes(cc_result[:language_size]) Result.new( language == "und" ? nil : language.to_sym, cc_result[:probability], cc_result[:reliable?], cc_result[:proportion]) end |