IndonesianStemmer

Gem Version Build Status Dependency Status Code Climate

Stems Indonesian words based on Porter Stemmer, with the algorithm presented in A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia, by Fadillah Z Tala.

Installation

Add this line to your application's Gemfile:

gem 'indonesian_stemmer'

And then execute:

$ bundle

Or install it yourself as:

$ gem install indonesian_stemmer

Usage

require 'rubygems'
require 'indonesian_stemmer'

IndonesianStemmer.stem('mendengarkan')  # => "dengar"
'beriman'.stem                          # => "iman"

Known Problems

This gem is in active development, don't rely on this for your analysis or datamining projects. Currently there's no problems stemming Indonesian words. Please submit a ticket if you find one.

Contributing

Initially, this gem is based on Apache Lucene. Currently it's just a ruby port from its analyzer for Indonesian. Its stemmer library only analyze the word length, therefore some modifications added in order to get the actual stemmed word. Feel free to download Lucene's source code under analysis/common/src/java/org/apache/lucene/analysis/id/.

References

Some references to help your contribution:

  1. The Official Kamus Bahasa Indonesia
  2. To search Indonesian words and their roots, use the Unofficial Kamus Besar Bahasa Indonesia
  3. Wikipedia's Prefiks dalam Bahasa Indonesia

Steps

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request