Module: Stemmers

Defined in:
lib/stemmers.rb,
lib/stemmers/version.rb

Constant Summary collapse

VERSION =
"0.0.1"

Class Method Summary collapse

Class Method Details

.detect_language(text) ⇒ String?

Detects the language of the given text. If the language cannot be detected, it returns nil.

Parameters:

  • text (String)

    The text to be analyzed.

Returns:

  • (String, nil)

    The detected language code or nil if undetectable.



12
13
14
# File 'lib/stemmers.rb', line 12

def self.detect_language(text)
  Bindings.detect_language(text)
end

.normalize_word(word) ⇒ String

Normalizes a word by removing accents and diacritics. This is useful for languages where accents do not change the meaning of the word, such as Portuguese.

Parameters:

  • word (String)

    The word to be normalized.

Returns:

  • (String)

    The normalized word with accents removed.



77
78
79
# File 'lib/stemmers.rb', line 77

def self.normalize_word(word)
  word.unicode_normalize(:nfkd).gsub(/\p{M}/, "")
end

.stem(phrase, language:, clean: false, normalize: false) ⇒ Array<String>

Stems the given phrase in the specified language. If the language is not supported, it raises an ‘ArgumentError`.

Parameters:

  • phrase (String)

    The phrase to be stemmed.

  • language (String)

    The language of the phrase.

  • clean (Boolean) (defaults to: false)

    If true, removes stop words before stemming.

  • normalize (Boolean) (defaults to: false)

    If true, removes accents from the phrase after stemming.

Returns:

  • (Array<String>)

    An array of stemmed words.



51
52
53
54
55
56
57
58
59
60
# File 'lib/stemmers.rb', line 51

def self.stem(phrase, language:, clean: false, normalize: false)
  words = phrase.downcase.strip.split(/\s+/)

  if clean
    stop_words = stop_words(language)
    words = words.reject {|word| stop_words.include?(word) }
  end

  words.map {|word| stem_word(word, language:, normalize:) }
end

.stem_word(word, language:, normalize: false, lowercase: false) ⇒ String

Stems the given word in the specified language. If the language is not supported, it raises an ‘ArgumentError`.

Parameters:

  • word (String)

    The word to be stemmed.

  • language (String)

    The language of the word.

  • lowercase (Boolean) (defaults to: false)

    If true, converts the word to lowercase before stemming.

  • normalize (Boolean) (defaults to: false)

    If true, removes accents from the word after stemming.

Returns:

  • (String)

    The stemmed word.



34
35
36
37
38
39
40
# File 'lib/stemmers.rb', line 34

def self.stem_word(word, language:, normalize: false, lowercase: false)
  word = word.downcase if lowercase
  stem = Bindings.stem_word(word, language)
  stem = normalize_word(stem) if normalize

  stem
end

.stop_words(language) ⇒ Array<String>

Returns the stop words for the specified language. If the language is not supported, an empty list is returned.

Parameters:

  • language (String)

    The language for which to retrieve stop words.

Returns:

  • (Array<String>)

    An array of stop words.



67
68
69
# File 'lib/stemmers.rb', line 67

def self.stop_words(language)
  stop_words_cache[language]
end

.stop_words_cacheHash<String, Array<String>>

Returns a cache of stop words loaded from a JSON file. The cache is initialized only once and reused for subsequent calls.

Returns:

  • (Hash<String, Array<String>>)

    A hash mapping language codes to arrays of stop words.



85
86
87
88
89
90
# File 'lib/stemmers.rb', line 85

def self.stop_words_cache
  @stop_words_cache ||= Hash.new do |hash, key|
    path = File.join(__dir__, "stemmers/stopwords/#{key}.json")
    hash[key] = File.file?(path) ? JSON.load_file(path) : []
  end
end

.supported_language?(language) ⇒ Boolean

Detects if the language is supported by the stemmers.

Parameters:

  • language (String)

    The language to check.

Returns:

  • (Boolean)

    True if the language is supported, false otherwise.



20
21
22
# File 'lib/stemmers.rb', line 20

def self.supported_language?(language)
  Bindings.supported_language?(language)
end