Class: CLD3::NNetLanguageIdentifier

Inherits:

Object

Object
CLD3::NNetLanguageIdentifier

show all

Defined in:: lib/cld3.rb

Overview

Class for detecting the language of a document.

Defined Under Namespace

Classes: Result, SpanInfo

Instance Method Summary collapse

#find_language(text) ⇒ Object

Finds the most likely language for the given text, along with additional information (e.g., probability).
#find_top_n_most_freq_langs(text, num_langs) ⇒ Object

Splits the input text (up to the first byte, if any, that is not interchange valid UTF8) into spans based on the script, predicts a language for each span, and returns a vector storing the top num_langs most frequent languages along with additional information (e.g., proportions).
#initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ NNetLanguageIdentifier constructor

The arguments are two Numeric objects.

Constructor Details

#initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER) ⇒ `NNetLanguageIdentifier`

The arguments are two Numeric objects.

Raises:

(ArgumentError)

# File 'lib/cld3.rb', line 49

def initialize(min_num_bytes = MIN_NUM_BYTES_TO_CONSIDER, max_num_bytes = MAX_NUM_BYTES_TO_CONSIDER)
  min_num_bytes = min_num_bytes.ceil
  max_num_bytes = max_num_bytes.floor
  raise ArgumentError if min_num_bytes < 0 || min_num_bytes >= max_num_bytes
  @cc = Unstable.make(min_num_bytes, max_num_bytes)
end

Instance Method Details

#find_language(text) ⇒ `Object`

Finds the most likely language for the given text, along with additional information (e.g., probability). The prediction is based on the first N bytes where N is the minimum between the number of interchange valid UTF8 bytes and max_num_bytes_. If N is less than min_num_bytes_ long, then this function returns nil. The argument is a String object. The returned value of this function is an instance of Result.



63
64
65

# File 'lib/cld3.rb', line 63

def find_language(text)
  @cc.find_language(Result, SpanInfo, text.encode(Encoding::UTF_8))
end

#find_top_n_most_freq_langs(text, num_langs) ⇒ `Object`

Splits the input text (up to the first byte, if any, that is not interchange valid UTF8) into spans based on the script, predicts a language for each span, and returns a vector storing the top num_langs most frequent languages along with additional information (e.g., proportions). The number of bytes considered for each span is the minimum between the size of the span and max_num_bytes_. If more languages are requested than what is available in the input, then the number of the returned elements will be the number of the latter. Also, if the size of the span is less than min_num_bytes_ long, then the span is skipped. If the input text is too long, only the first MAX_NUM_INPUT_BYTES_TO_CONSIDER bytes are processed. The first argument is a String object. The second argument is Numeric object. The returned value of this functions is an Array of Result instances.

# File 'lib/cld3.rb', line 80

def find_top_n_most_freq_langs(text, num_langs)
  @cc.find_top_n_most_freq_langs(Result, SpanInfo,
                                 text.encode(Encoding::UTF_8),
                                 num_langs)
end