Class: CLD3::NNetLanguageIdentifier

Inherits:
Object
  • Object
show all
Defined in:
lib/cld3.rb

Overview

Class for detecting the language of a document.

Defined Under Namespace

Classes: Result

Constant Summary collapse

MIN_NUM_BYTES_TO_CONSIDER =

Min number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.

140
MAX_NUM_BYTES_TO_CONSIDER =

Max number of bytes needed to make a prediction if the construcotr is called without the corresponding parameter. This is Numeric object.

700
MAX_NUM_INPUT_BYTES_TO_CONSIDER =

Max number of input bytes to process. This is Numeric object.

10000
RELIABILITY_THRESHOLD =

Predictions with probability greater than or equal to this threshold are marked as reliable. This threshold was optimized on a set of text segments extracted from wikipedia, and results in an overall precision, recall, and f1 equal to 0.9760, 0.9624, and 0.9692, respectively. This is Numeric object.

0.7
RELIABILITY_HR_BS_THRESHOLD =

Reliability threshold for the languages hr and bs. This is Numeric object.

0.5

Instance Method Summary collapse

Constructor Details

#initialize(minNumBytes = MIN_NUM_BYTES_TO_CONSIDER, maxNumBytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ NNetLanguageIdentifier

The arguments are two String objects.



67
68
69
# File 'lib/cld3.rb', line 67

def initialize(minNumBytes = MIN_NUM_BYTES_TO_CONSIDER, maxNumBytes = MAX_NUM_BYTES_TO_CONSIDER)
  @cc = Unstable::NNetLanguageIdentifier::Pointer.new(Unstable.new_NNetLanguageIdentifier(minNumBytes, maxNumBytes))
end

Instance Method Details

#find_language(text) ⇒ Object

Finds the most likely language for the given text, along with additional information (e.g., probability). The prediction is based on the first N bytes where N is the minumum between the number of interchange valid UTF8 bytes and max_num_bytes_. If N is less than min_num_bytes_ long, then this function returns nil as language. The argument is a String object. The returned value of this function is an instance of Result.



78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/cld3.rb', line 78

def find_language(text)
  text_utf8 = text.encode(Encoding::UTF_8)
  pointer = FFI::MemoryPointer.new(:char, text_utf8.bytesize)
  pointer.put_bytes(0, text_utf8)

  cc_result = Unstable.NNetLanguageIdentifier_find_language(@cc, pointer, text_utf8.bytesize)
  language = cc_result[:language_data].read_bytes(cc_result[:language_size])

  Result.new(
      language == "und" ? nil : language.to_sym,
      cc_result[:probability],
      cc_result[:reliable?],
      cc_result[:proportion])
end