Module: Stemmers

Defined in:: lib/stemmers.rb,
lib/stemmers/version.rb

Constant Summary collapse

VERSION =

"0.0.1"

Class Method Summary collapse

.detect_language(text) ⇒ String^?

Detects the language of the given text.
.normalize_word(word) ⇒ String

Normalizes a word by removing accents and diacritics.
.stem(phrase, language:, clean: false, normalize: false) ⇒ Array<String>

Stems the given phrase in the specified language.
.stem_word(word, language:, normalize: false, lowercase: false) ⇒ String

Stems the given word in the specified language.
.stop_words(language) ⇒ Array<String>

Returns the stop words for the specified language.
.stop_words_cache ⇒ Hash<String, Array<String>>

Returns a cache of stop words loaded from a JSON file.
.supported_language?(language) ⇒ Boolean

Detects if the language is supported by the stemmers.

Class Method Details

.detect_language(text) ⇒ `String`^?

Detects the language of the given text. If the language cannot be detected, it returns nil.

Parameters:

text (String) —

The text to be analyzed.

Returns:

(String, nil) —

The detected language code or nil if undetectable.



12
13
14

# File 'lib/stemmers.rb', line 12

def self.detect_language(text)
  Bindings.detect_language(text)
end

.normalize_word(word) ⇒ `String`

Normalizes a word by removing accents and diacritics. This is useful for languages where accents do not change the meaning of the word, such as Portuguese.

Parameters:

word (String) —

The word to be normalized.

Returns:

(String) —

The normalized word with accents removed.



77
78
79

# File 'lib/stemmers.rb', line 77

def self.normalize_word(word)
  word.unicode_normalize(:nfkd).gsub(/\p{M}/, "")
end

.stem(phrase, language:, clean: false, normalize: false) ⇒ `Array<String>`

Stems the given phrase in the specified language. If the language is not supported, it raises an ‘ArgumentError`.

Parameters:

phrase (String) —

The phrase to be stemmed.
language (String) —

The language of the phrase.
clean (Boolean) (defaults to: false) —

If true, removes stop words before stemming.
normalize (Boolean) (defaults to: false) —

If true, removes accents from the phrase after stemming.

Returns:

(Array<String>) —

An array of stemmed words.

# File 'lib/stemmers.rb', line 51

def self.stem(phrase, language:, clean: false, normalize: false)
  words = phrase.downcase.strip.split(/\s+/)

  if clean
    stop_words = stop_words(language)
    words = words.reject {|word| stop_words.include?(word) }
  end

  words.map {|word| stem_word(word, language:, normalize:) }
end

.stem_word(word, language:, normalize: false, lowercase: false) ⇒ `String`

Stems the given word in the specified language. If the language is not supported, it raises an ‘ArgumentError`.

Parameters:

word (String) —

The word to be stemmed.
language (String) —

The language of the word.
lowercase (Boolean) (defaults to: false) —

If true, converts the word to lowercase before stemming.
normalize (Boolean) (defaults to: false) —

If true, removes accents from the word after stemming.

Returns:

(String) —

The stemmed word.

# File 'lib/stemmers.rb', line 34

def self.stem_word(word, language:, normalize: false, lowercase: false)
  word = word.downcase if lowercase
  stem = Bindings.stem_word(word, language)
  stem = normalize_word(stem) if normalize

  stem
end

.stop_words(language) ⇒ `Array<String>`

Returns the stop words for the specified language. If the language is not supported, an empty list is returned.

Parameters:

language (String) —

The language for which to retrieve stop words.

Returns:

(Array<String>) —

An array of stop words.



67
68
69

# File 'lib/stemmers.rb', line 67

def self.stop_words(language)
  stop_words_cache[language]
end

.stop_words_cache ⇒ `Hash<String, Array<String>>`

Returns a cache of stop words loaded from a JSON file. The cache is initialized only once and reused for subsequent calls.

Returns:

(Hash<String, Array<String>>) —

A hash mapping language codes to arrays of stop words.

# File 'lib/stemmers.rb', line 85

def self.stop_words_cache
  @stop_words_cache ||= Hash.new do |hash, key|
    path = File.join(__dir__, "stemmers/stopwords/#{key}.json")
    hash[key] = File.file?(path) ? JSON.load_file(path) : []
  end
end

.supported_language?(language) ⇒ `Boolean`

Detects if the language is supported by the stemmers.

Parameters:

language (String) —

The language to check.

Returns:

(Boolean) —

True if the language is supported, false otherwise.



20
21
22

# File 'lib/stemmers.rb', line 20

def self.supported_language?(language)
  Bindings.supported_language?(language)
end

Module: Stemmers

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.detect_language(text) ⇒ String?

.normalize_word(word) ⇒ String

.stem(phrase, language:, clean: false, normalize: false) ⇒ Array<String>

.stem_word(word, language:, normalize: false, lowercase: false) ⇒ String

.stop_words(language) ⇒ Array<String>

.stop_words_cache ⇒ Hash<String, Array<String>>

.supported_language?(language) ⇒ Boolean

.detect_language(text) ⇒ `String`^?

.normalize_word(word) ⇒ `String`

.stem(phrase, language:, clean: false, normalize: false) ⇒ `Array<String>`

.stem_word(word, language:, normalize: false, lowercase: false) ⇒ `String`

.stop_words(language) ⇒ `Array<String>`

.stop_words_cache ⇒ `Hash<String, Array<String>>`

.supported_language?(language) ⇒ `Boolean`