Class: Treat::Workers::Inflectors::Stemmers::Porter

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/inflectors/stemmers/porter.rb

Overview

Stemming using a native Ruby implementation of the Porter stemming algorithm, a rule-based suffix-stripping stemmer which is very widely used and is considered the de-facto standard algorithm used for English stemming.

Original paper: Porter, 1980. An algorithm for suffix stripping. Program, vol. 14, no. 3, p. 130-137.

Authors: Ray Pereda ([email protected]). License: Unknown.

Constant Summary collapse

STEP_2_LIST =
{
  'ational'=>'ate', 'tional'=>'tion', 'enci'=>'ence', 'anci'=>'ance',
  'izer'=>'ize', 'bli'=>'ble',
  'alli'=>'al', 'entli'=>'ent', 'eli'=>'e', 'ousli'=>'ous',
  'ization'=>'ize', 'ation'=>'ate',
  'ator'=>'ate', 'alism'=>'al', 'iveness'=>'ive', 'fulness'=>'ful',
  'ousness'=>'ous', 'anati'=>'al',
  'iviti'=>'ive', 'binati'=>'ble', 'logi'=>'log'
}
STEP_3_LIST =
{
  'icate'=>'ic', 'ative'=>'', 'alize'=>'al', 'iciti'=>'ic',
  'ical'=>'ic', 'ful'=>'', 'ness'=>''
}
SUFFIX_1_REGEXP =
/(
ational  |
tional   |
enci     |
anci     |
izer     |
bli      |
alli     |
entli    |
eli      |
ousli    |
ization  |
ation    |
ator     |
alism    |
iveness  |
fulness  |
ousness  |
anati    |
iviti    |
binati   |
logi)$/x
SUFFIX_2_REGEXP =
/(
al       |
ance     |
ence     |
er       |
ic       |
able     |
ible     |
ant      |
ement    |
ment     |
ent      |
ou       |
ism      |
ate      |
iti      |
ous      |
ive      |
ize)$/x
C =

consonant

"[^aeiou]"
V =

vowel

"[aeiouy]"
CC =

consonant sequence

"#{C}(?>[^aeiouy]*)"
VV =

vowel sequence

"#{V}(?>[aeiou]*)"
MGR0 =

[cc]vvcc… is m>0

/^(#{CC})?#{VV}#{CC}/o
MEQ1 =
cc]vvcc[vv

is m=1

/^(#{CC})?#{VV}#{CC}(#{VV})?$/o
MGR1 =

[cc]vvccvvcc… is m>1

/^(#{CC})?#{VV}#{CC}#{VV}#{CC}/o
VOWEL_IN_STEM =

vowel in stem

/^(#{CC})?#{V}/o

Class Method Summary collapse

Class Method Details

.stem(word, options = {}) ⇒ Object

Returns the stem of a word using a native Porter stemmer.

Options: none.



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/treat/workers/inflectors/stemmers/porter.rb', line 16

def self.stem(word, options = {})
  # Copy the word and convert it to a string.
  w = word.to_s
  return w if w.length < 3
  # Map initial y to Y so that the patterns
  # never treat it as vowel.
  w[0] = 'Y' if w[0] == ?y
  # Step 1a
  if w =~ /(ss|i)es$/
    w = $` + $1
  elsif w =~ /([^s])s$/
    w = $` + $1
  end
  # Step 1b
  if w =~ /eed$/
    w.chop! if $` =~ MGR0
  elsif w =~ /(ed|ing)$/
    stem = $`
    if stem =~ VOWEL_IN_STEM
      w = stem
      case w
      when /(at|bl|iz)$/             then w << "e"
      when /([^aeiouylsz])\1$/       then w.chop!
      when /^#{CC}#{V}[^aeiouwxy]$/o then w << "e"
      end
    end
  end
  if w =~ /y$/
    stem = $`
    w = stem + "i" if stem =~ VOWEL_IN_STEM
  end
  # Step 2
  if w =~ SUFFIX_1_REGEXP
    stem = $`
    suffix = $1
    if stem =~ MGR0
      w = stem + STEP_2_LIST[suffix]
    end
  end
  # Step 3
  if w =~
    /(icate|ative|alize|iciti|ical|ful|ness)$/
    stem = $`
    suffix = $1
    if stem =~ MGR0
      w = stem + STEP_3_LIST[suffix]
    end
  end
  # Step 4
  if w =~ SUFFIX_2_REGEXP
    stem = $`
    if stem =~ MGR1
      w = stem
    end
  elsif w =~ /(s|t)(ion)$/
    stem = $` + $1
    if stem =~ MGR1
      w = stem
    end
  end
  #  Step 5
  if w =~ /e$/
    stem = $`
    if (stem =~ MGR1) ||
      (stem =~ MEQ1 && stem !~
      /^#{CC}#{V}[^aeiouwxy]$/o)
      w = stem
    end
  end
  if w =~ /ll$/ && w =~ MGR1
    w.chop!
  end
  # and turn initial Y back to y
  w[0] = 'y' if w[0] == ?Y
  w
end