Ruby port of TinySegmenter.js for tokenizing Japanese text. Ruby 1.9 or higher required.

Build Status

Install

gem install tiny_segmenter or add tiny_segmenter to your Gemfile

Usage

ts = TinySegmenter.new
p ts.segment("今晩は!良い天気ですね")
# => ["今晩", "は", "!", "良い", "天気", "です", "ね"]

Input text should be UTF-8 encoded.

How it works

The Naive Bayes model was trained using the RWCP corpus and optimized using L1-norm regularization (e.g. this). The resultant model is quite compact, yet (according to the author) has about a 95% accuracy rate.

License

BSD - see http://chasen.org/~taku/software/TinySegmenter/LICENCE.txt