Class: CLD3::NNetLanguageIdentifier
- Inherits:
-
Object
- Object
- CLD3::NNetLanguageIdentifier
- Defined in:
- lib/cld3.rb
Overview
Class for detecting the language of a document.
Defined Under Namespace
Constant Summary collapse
- MIN_NUM_BYTES_TO_CONSIDER =
Min number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.
140
- MAX_NUM_BYTES_TO_CONSIDER =
Max number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.
700
- MAX_NUM_INPUT_BYTES_TO_CONSIDER =
Max number of input bytes to process. This is Numeric object.
10000
- RELIABILITY_THRESHOLD =
Predictions with probability greater than or equal to this threshold are marked as reliable. This threshold was optimized on a set of text segments extracted from wikipedia, and results in an overall precision, recall, and f1 equal to 0.9760, 0.9624, and 0.9692, respectively. This is Numeric object.
0.7
- RELIABILITY_HR_BS_THRESHOLD =
Reliability threshold for the languages hr and bs. This is Numeric object.
0.5
Instance Method Summary collapse
-
#find_language(text) ⇒ Object
Finds the most likely language for the given text, along with additional information (e.g., probability).
-
#find_top_n_most_freq_langs(text, num_langs) ⇒ Object
Splits the input text (up to the first byte, if any, that is not interchange valid UTF8) into spans based on the script, predicts a language for each span, and returns a vector storing the top num_langs most frequent languages along with additional information (e.g., proportions).
-
#initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ NNetLanguageIdentifier
constructor
The arguments are two String objects.
Constructor Details
#initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ NNetLanguageIdentifier
The arguments are two String objects.
75 76 77 |
# File 'lib/cld3.rb', line 75 def initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER) @cc = Unstable::NNetLanguageIdentifier::Pointer.new(Unstable.new_NNetLanguageIdentifier(min_num_bytes, max_num_bytes)) end |
Instance Method Details
#find_language(text) ⇒ Object
Finds the most likely language for the given text, along with additional information (e.g., probability). The prediction is based on the first N bytes where N is the minumum between the number of interchange valid UTF8 bytes and max_num_bytes_
. If N is less than min_num_bytes_
long, then this function returns nil. The argument is a String object. The returned value of this function is an instance of Result.
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
# File 'lib/cld3.rb', line 86 def find_language(text) text_utf8 = text.encode(Encoding::UTF_8) pointer = FFI::MemoryPointer.new(:char, text_utf8.bytesize) begin pointer.put_bytes(0, text_utf8) result = Unstable.NNetLanguageIdentifier_find_language(@cc, pointer, text_utf8.bytesize) begin convert_result Unstable::NNetLanguageIdentifier::Result.new(result) ensure Unstable.delete_result result end ensure pointer.free end end |
#find_top_n_most_freq_langs(text, num_langs) ⇒ Object
Splits the input text (up to the first byte, if any, that is not interchange valid UTF8) into spans based on the script, predicts a language for each span, and returns a vector storing the top num_langs most frequent languages along with additional information (e.g., proportions). The number of bytes considered for each span is the minimum between the size of the span and max_num_bytes_
. If more languages are requested than what is available in the input, then the number of the returned elements will be the number of the latter. Also, if the size of the span is less than min_num_bytes_
long, then the span is skipped. If the input text is too long, only the first MAX_NUM_INPUT_BYTES_TO_CONSIDER
bytes are processed. The first argument is a String object. The second argument is Numeric object. The returned value of this functions is an Array of Result instances.
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/cld3.rb', line 117 def find_top_n_most_freq_langs(text, num_langs) text_utf8 = text.encode(Encoding::UTF_8) pointer = FFI::MemoryPointer.new(:char, text_utf8.bytesize) begin pointer.put_bytes(0, text_utf8) results = Unstable.NNetLanguageIdentifier_find_top_n_most_freq_langs(@cc, pointer, text_utf8.bytesize, num_langs) begin num_langs.times .lazy .map { |index| convert_result Unstable.refer_to_nth_result(results, index) } .take_while { |result| !result.nil? } .to_a ensure Unstable.delete_results results end ensure pointer.free end end |