Module: Stemmers
- Defined in:
- lib/stemmers.rb,
lib/stemmers/version.rb
Constant Summary collapse
- VERSION =
"0.0.1"
Class Method Summary collapse
-
.detect_language(text) ⇒ String?
Detects the language of the given text.
-
.normalize_word(word) ⇒ String
Normalizes a word by removing accents and diacritics.
-
.stem(phrase, language:, clean: false, normalize: false) ⇒ Array<String>
Stems the given phrase in the specified language.
-
.stem_word(word, language:, normalize: false, lowercase: false) ⇒ String
Stems the given word in the specified language.
-
.stop_words(language) ⇒ Array<String>
Returns the stop words for the specified language.
-
.stop_words_cache ⇒ Hash<String, Array<String>>
Returns a cache of stop words loaded from a JSON file.
-
.supported_language?(language) ⇒ Boolean
Detects if the language is supported by the stemmers.
Class Method Details
.detect_language(text) ⇒ String?
Detects the language of the given text. If the language cannot be detected, it returns nil.
12 13 14 |
# File 'lib/stemmers.rb', line 12 def self.detect_language(text) Bindings.detect_language(text) end |
.normalize_word(word) ⇒ String
Normalizes a word by removing accents and diacritics. This is useful for languages where accents do not change the meaning of the word, such as Portuguese.
77 78 79 |
# File 'lib/stemmers.rb', line 77 def self.normalize_word(word) word.unicode_normalize(:nfkd).gsub(/\p{M}/, "") end |
.stem(phrase, language:, clean: false, normalize: false) ⇒ Array<String>
Stems the given phrase in the specified language. If the language is not supported, it raises an ‘ArgumentError`.
51 52 53 54 55 56 57 58 59 60 |
# File 'lib/stemmers.rb', line 51 def self.stem(phrase, language:, clean: false, normalize: false) words = phrase.downcase.strip.split(/\s+/) if clean stop_words = stop_words(language) words = words.reject {|word| stop_words.include?(word) } end words.map {|word| stem_word(word, language:, normalize:) } end |
.stem_word(word, language:, normalize: false, lowercase: false) ⇒ String
Stems the given word in the specified language. If the language is not supported, it raises an ‘ArgumentError`.
34 35 36 37 38 39 40 |
# File 'lib/stemmers.rb', line 34 def self.stem_word(word, language:, normalize: false, lowercase: false) word = word.downcase if lowercase stem = Bindings.stem_word(word, language) stem = normalize_word(stem) if normalize stem end |
.stop_words(language) ⇒ Array<String>
Returns the stop words for the specified language. If the language is not supported, an empty list is returned.
67 68 69 |
# File 'lib/stemmers.rb', line 67 def self.stop_words(language) stop_words_cache[language] end |
.stop_words_cache ⇒ Hash<String, Array<String>>
Returns a cache of stop words loaded from a JSON file. The cache is initialized only once and reused for subsequent calls.
85 86 87 88 89 90 |
# File 'lib/stemmers.rb', line 85 def self.stop_words_cache @stop_words_cache ||= Hash.new do |hash, key| path = File.join(__dir__, "stemmers/stopwords/#{key}.json") hash[key] = File.file?(path) ? JSON.load_file(path) : [] end end |
.supported_language?(language) ⇒ Boolean
Detects if the language is supported by the stemmers.
20 21 22 |
# File 'lib/stemmers.rb', line 20 def self.supported_language?(language) Bindings.supported_language?(language) end |