Class: Treat::Workers::Inflectors::Stemmers::Porter
- Inherits:
-
Object
- Object
- Treat::Workers::Inflectors::Stemmers::Porter
- Defined in:
- lib/treat/workers/inflectors/stemmers/porter.rb
Overview
Stemming using a native Ruby implementation of the Porter stemming algorithm, a rule-based suffix-stripping stemmer which is very widely used and is considered the de-facto standard algorithm used for English stemming.
Original paper: Porter, 1980. An algorithm for suffix stripping. Program, vol. 14, no. 3, p. 130-137.
Authors: Ray Pereda ([email protected]). License: Unknown.
Constant Summary collapse
- STEP_2_LIST =
{ 'ational'=>'ate', 'tional'=>'tion', 'enci'=>'ence', 'anci'=>'ance', 'izer'=>'ize', 'bli'=>'ble', 'alli'=>'al', 'entli'=>'ent', 'eli'=>'e', 'ousli'=>'ous', 'ization'=>'ize', 'ation'=>'ate', 'ator'=>'ate', 'alism'=>'al', 'iveness'=>'ive', 'fulness'=>'ful', 'ousness'=>'ous', 'anati'=>'al', 'iviti'=>'ive', 'binati'=>'ble', 'logi'=>'log' }
- STEP_3_LIST =
{ 'icate'=>'ic', 'ative'=>'', 'alize'=>'al', 'iciti'=>'ic', 'ical'=>'ic', 'ful'=>'', 'ness'=>'' }
- SUFFIX_1_REGEXP =
/( ational | tional | enci | anci | izer | bli | alli | entli | eli | ousli | ization | ation | ator | alism | iveness | fulness | ousness | anati | iviti | binati | logi)$/x
- SUFFIX_2_REGEXP =
/( al | ance | ence | er | ic | able | ible | ant | ement | ment | ent | ou | ism | ate | iti | ous | ive | ize)$/x
- C =
consonant
"[^aeiou]"
- V =
vowel
"[aeiouy]"
- CC =
consonant sequence
"#{C}(?>[^aeiouy]*)"
- VV =
vowel sequence
"#{V}(?>[aeiou]*)"
- MGR0 =
[cc]vvcc… is m>0
/^(#{CC})?#{VV}#{CC}/o
- MEQ1 =
- cc]vvcc[vv
-
is m=1
/^(#{CC})?#{VV}#{CC}(#{VV})?$/o
- MGR1 =
[cc]vvccvvcc… is m>1
/^(#{CC})?#{VV}#{CC}#{VV}#{CC}/o
- VOWEL_IN_STEM =
vowel in stem
/^(#{CC})?#{V}/o
Class Method Summary collapse
-
.stem(word, options = {}) ⇒ Object
Returns the stem of a word using a native Porter stemmer.
Class Method Details
.stem(word, options = {}) ⇒ Object
Returns the stem of a word using a native Porter stemmer.
Options: none.
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/treat/workers/inflectors/stemmers/porter.rb', line 16 def self.stem(word, = {}) # Copy the word and convert it to a string. w = word.to_s return w if w.length < 3 # Map initial y to Y so that the patterns # never treat it as vowel. w[0] = 'Y' if w[0] == ?y # Step 1a if w =~ /(ss|i)es$/ w = $` + $1 elsif w =~ /([^s])s$/ w = $` + $1 end # Step 1b if w =~ /eed$/ w.chop! if $` =~ MGR0 elsif w =~ /(ed|ing)$/ stem = $` if stem =~ VOWEL_IN_STEM w = stem case w when /(at|bl|iz)$/ then w << "e" when /([^aeiouylsz])\1$/ then w.chop! when /^#{CC}#{V}[^aeiouwxy]$/o then w << "e" end end end if w =~ /y$/ stem = $` w = stem + "i" if stem =~ VOWEL_IN_STEM end # Step 2 if w =~ SUFFIX_1_REGEXP stem = $` suffix = $1 if stem =~ MGR0 w = stem + STEP_2_LIST[suffix] end end # Step 3 if w =~ /(icate|ative|alize|iciti|ical|ful|ness)$/ stem = $` suffix = $1 if stem =~ MGR0 w = stem + STEP_3_LIST[suffix] end end # Step 4 if w =~ SUFFIX_2_REGEXP stem = $` if stem =~ MGR1 w = stem end elsif w =~ /(s|t)(ion)$/ stem = $` + $1 if stem =~ MGR1 w = stem end end # Step 5 if w =~ /e$/ stem = $` if (stem =~ MGR1) || (stem =~ MEQ1 && stem !~ /^#{CC}#{V}[^aeiouwxy]$/o) w = stem end end if w =~ /ll$/ && w =~ MGR1 w.chop! end # and turn initial Y back to y w[0] = 'y' if w[0] == ?Y w end |