Class: CLD3::NNetLanguageIdentifier

Inherits:
Object
  • Object
show all
Defined in:
lib/cld3.rb

Overview

Class for detecting the language of a document.

Defined Under Namespace

Classes: Result, SpanInfo

Constant Summary collapse

MIN_NUM_BYTES_TO_CONSIDER =

Min number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.

140
MAX_NUM_BYTES_TO_CONSIDER =

Max number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.

700
MAX_NUM_INPUT_BYTES_TO_CONSIDER =

Max number of input bytes to process. This is Numeric object.

10000
RELIABILITY_THRESHOLD =

Predictions with probability greater than or equal to this threshold are marked as reliable. This threshold was optimized on a set of text segments extracted from wikipedia, and results in an overall precision, recall, and f1 equal to 0.9760, 0.9624, and 0.9692, respectively. This is Numeric object.

0.7
RELIABILITY_HR_BS_THRESHOLD =

Reliability threshold for the languages hr and bs. This is Numeric object.

0.5

Instance Method Summary collapse

Constructor Details

#initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ NNetLanguageIdentifier

The arguments are two Numeric objects.

Raises:

  • (ArgumentError)


78
79
80
81
# File 'lib/cld3.rb', line 78

def initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER)
  raise ArgumentError if min_num_bytes < 0 || min_num_bytes >= max_num_bytes
  @cc = Unstable::NNetLanguageIdentifier::Pointer.new(Unstable.new_NNetLanguageIdentifier(min_num_bytes, max_num_bytes))
end

Instance Method Details

#find_language(text) ⇒ Object

Finds the most likely language for the given text, along with additional information (e.g., probability). The prediction is based on the first N bytes where N is the minimum between the number of interchange valid UTF8 bytes and max_num_bytes_. If N is less than min_num_bytes_ long, then this function returns nil. The argument is a String object. The returned value of this function is an instance of Result.



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/cld3.rb', line 90

def find_language(text)
  # @type const FFI: untyped

  text_utf8 = text.encode(Encoding::UTF_8)
  pointer = FFI::MemoryPointer.new(:char, text_utf8.bytesize)

  begin
    pointer.put_bytes(0, text_utf8)

    result = Unstable.NNetLanguageIdentifier_find_language(@cc, pointer, text_utf8.bytesize)
    begin
      convert_result Unstable::NNetLanguageIdentifier::Result.new(result)
    ensure
      Unstable.delete_result result
    end
  ensure
    pointer.free
  end
end

#find_top_n_most_freq_langs(text, num_langs) ⇒ Object

Splits the input text (up to the first byte, if any, that is not interchange valid UTF8) into spans based on the script, predicts a language for each span, and returns a vector storing the top num_langs most frequent languages along with additional information (e.g., proportions). The number of bytes considered for each span is the minimum between the size of the span and max_num_bytes_. If more languages are requested than what is available in the input, then the number of the returned elements will be the number of the latter. Also, if the size of the span is less than min_num_bytes_ long, then the span is skipped. If the input text is too long, only the first MAX_NUM_INPUT_BYTES_TO_CONSIDER bytes are processed. The first argument is a String object. The second argument is Numeric object. The returned value of this functions is an Array of Result instances.



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/cld3.rb', line 123

def find_top_n_most_freq_langs(text, num_langs)
  # @type const FFI: untyped
  # @type var a: untyped

  text_utf8 = text.encode(Encoding::UTF_8)
  pointer = FFI::MemoryPointer.new(:char, text_utf8.bytesize)

  begin
    pointer.put_bytes(0, text_utf8)

    results = Unstable.NNetLanguageIdentifier_find_top_n_most_freq_langs(@cc, pointer, text_utf8.bytesize, num_langs)
    begin
      a = num_langs.times
        .lazy
        .map { |index| convert_result Unstable.refer_to_nth_result(results, index) }
        .take_while { |result| !result.nil? }
        .to_a

      a
    ensure
      Unstable.delete_results results
    end
  ensure
    pointer.free
  end
end