Module: EStem

Included in:
String
Defined in:
lib/estem.rb

Overview

:title: Spanish Stemming

Description

This gem is for reducing Spanish words to their roots. It uses an algorithm based on Martin Porter’s specifications.

For more information, visit: snowball.tartarus.org/algorithms/spanish/stemmer.html

Descripción

Esta gema está para reducir las palabras del Español en sus respectivas raíces, para ello ultiliza un algoritmo basado en las especificaciones de Martin Porter

Para más información, visite: snowball.tartarus.org/algorithms/spanish/stemmer.html

License – Licencia

This code is provided under the terms of the MIT License.

Authors

* Manuel A. G

Instance Method Summary collapse

Instance Method Details

#es_stemObject

This method stem Spanish words.

"albergues".es_stem      # ==> "alberg"
"habitaciones".es_stem   # ==> "habit"
"ALbeRGues".es_stem      # ==> "ALbeRG"
"HaBiTaCiOnEs".es_stem   # ==> "HaBiT"
"Hacinamiento".es_stem   # ==> "Hacin"

If you are not aware of the codeset the data have, try using String#safe_es_stem instead.

:call-seq: str.es_stem => “new_str”



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/estem.rb', line 44

def es_stem
  str = self.dup
  case str.length
  when 0
    return str
  when 1
    return remove_accent(str)
  end

  step0(str)
  unless step1(str)
    step2b(str) unless step2a(str)
  end

  step3(str)
  remove_accent(str)
end

#safe_es_stemObject

Use this method in case you are not aware of the codeset the data being handle have. This method returns a new string with the same codeset as the original. Be aware that this method is a bit slower than String#es_stem :call-seq: str.safe_es_stem => “new_str”



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# File 'lib/estem.rb', line 68

def safe_es_stem
  if self.encoding == Encoding::UTF_8
    # remove invalid characters
    return self.chars.select{|c| c.valid_encoding? }.join.es_stem
  end

  unless self.valid_encoding?
    tmp = self.dup
    if tmp.force_encoding('UTF-8').valid_encoding?
      begin
        return tmp.es_stem
      rescue
      end
    end
  end

  default_enc = self.encoding.name
  str = self.chars.select{|c| c.valid_encoding? }.join

  return nil if str.empty?

  begin
    tmp = str.encode('UTF-8', str.encoding.name).es_stem
    return tmp.encode(default_enc, 'UTF-8');
  rescue
    return nil
  end
end