Module: EStem
- Included in:
- String
- Defined in:
- lib/estem.rb
Overview
:title: Spanish Stemming
Description
This gem is for reducing Spanish words to their roots. It uses an algorithm based on Martin Porter’s specifications.
For more information, visit: snowball.tartarus.org/algorithms/spanish/stemmer.html
Descripción
Esta gema está para reducir las palabras del Español en sus respectivas raíces, para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter
Para más información, visite: snowball.tartarus.org/algorithms/spanish/stemmer.html
License – Licencia
This code is provided under the terms of the MIT License.
Authors
* Manuel A. G
Instance Method Summary collapse
-
#es_stem ⇒ Object
This method stem Spanish words.
-
#safe_es_stem ⇒ Object
Use this method in case you are not aware of the codeset the data being handle have.
Instance Method Details
#es_stem ⇒ Object
This method stem Spanish words.
"albergues".es_stem # ==> "alberg"
"habitaciones".es_stem # ==> "habit"
"ALbeRGues".es_stem # ==> "ALbeRG"
"HaBiTaCiOnEs".es_stem # ==> "HaBiT"
"Hacinamiento".es_stem # ==> "Hacin"
If you are not aware of the codeset the data have, try using String#safe_es_stem instead.
:call-seq: str.es_stem => “new_str”
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/estem.rb', line 44 def es_stem str = self.dup case str.length when 0 return str when 1 return remove_accent(str) end step0(str) unless step1(str) step2b(str) unless step2a(str) end step3(str) remove_accent(str) end |
#safe_es_stem ⇒ Object
Use this method in case you are not aware of the codeset the data being handle have. This method returns a new string with the same codeset as the original. Be aware that this method is a bit slower than String#es_stem :call-seq: str.safe_es_stem => “new_str”
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/estem.rb', line 68 def safe_es_stem if self.encoding == Encoding::UTF_8 # remove invalid characters return self.chars.select{|c| c.valid_encoding? }.join.es_stem end unless self.valid_encoding? tmp = self.dup if tmp.force_encoding('UTF-8').valid_encoding? begin return tmp.es_stem rescue end end end default_enc = self.encoding.name str = self.chars.select{|c| c.valid_encoding? }.join return nil if str.empty? begin tmp = str.encode('UTF-8', str.encoding.name).es_stem return tmp.encode(default_enc, 'UTF-8'); rescue return nil end end |