uea-stemmer

Similar to other stemmers, UEA-Lite operates on a set of rules which are used as steps. There are two groups of rules: the first to clean the tokens, and the second to alter suffixes.

The first group of rules first avoids a small list of six frequent problem words. An improvement to the stemmer would be to expand this list by adding other problem words which the second rule set cannot deal with. Second, possessive apostrophes are removed and contractions are expanded. All hyphens are removed and tokens containing digits are left untouched. Strings which are all upper case and digits are left untouched unless there is a lower case terminal ‘s’ (i.e. transforming plural forms of acronyms to singular forms).

Proper nouns should not usually be stemmed, except to remove possessives; our implementation will respect PoS tags if they are present. If the text is untagged the stemmer uses the simple heuristic that any capitalized token not preceded by sentence breaking punctuation is a proper noun.

Many texts, particularly scientific papers, contain sequences of digits, single letters, and other non-word tokens. Our implementation ignores tokens containing digits, single-letter tokens, and tokens with embedded punctuation.

The second group of rules contains 139 suffix rules, each testing for a specific type of suffix. The rules are set in a particular order so that the longest suffix applicable is used rather a shorter one which could lead to nonsense words and more words not stemmed entirely to their root form.

This is a port to Ruby from the port to Java from the original Perl script by Marie-Claire Jenkins and Dr. Dan J. Smith at the University of East Anglia.

Installation

Install the gem:

gem install uea-stemmer

Install the gem from source:

git clone git://github.com/ealdent/uea-stemmer.git
cd uea-stemmer
rake install

Depending on your setup, you may need to use sudo for either of these methods.

Example Usage

Typical usage:

require 'uea-stemmer'
stemmer = UEAStemmer.new

stemmer.stem('helpers')   # helper
stemmer.stem('dying')     # die
stemmer.stem('scarred')   # scar

'buries'.stem             # bury
'bodies'.stem             # body
'ordained'.stem           # ordain

You can also extract the stemmed word along with the rule by using the stem_with_rule method.

stem = stemmer.stem_with_rule('invited')   # Word('invite', Rule #22.3)
puts stem.rule  # rule #22.3 (remove -d when the word ends in -ited)

TODO

test handling of POS tags
add tests to mimic methodology used in the paper

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Add tests for it. This is important so I don’t break it in a future version unintentionally.
Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
Send me a pull request. Bonus points for topic branches.

Relevant Web Pages

Copyright

This project is distributed under the Apache 2.0 License. See LICENSE for details.