Introduction

Rseg is a Chinese Word Segmentation(中文分词) routine in pure Ruby.

The algorithm is based on this article: xiecc.blog.163.com/blog/static/14032200671110224190/

Usage

Rseg now support two modes: inline and C/S mode.

  1. Inline mode

> require ‘rubygems’ > require ‘rseg’ > Rseg.segment(“需要分词的文章”)

‘需要’, ‘分词’, ‘的’, ‘文章’

The first call to Rseg#segment will need about 30 seconds to load the dictionary, the second call will be very fast, you can also call Rseg#load to load dictionaries manually.

  1. C/S mode

$ rseg_server

Sinatra/0.9.4 has taken the stage on 4100

This will start rseg server on localhost:4100

You can visit it via your browser or the rseg command.

$ rseg ‘需要分词的文章’ 需要 分词 的 文章

You can also access server with the Rseg#remote_segment

$ irb > require ‘rubygems’ > require ‘rseg’ > Rseg.remote_segment(“需要分词的文章”) # This will be very fast

‘需要’, ‘分词’, ‘的’, ‘文章’

Performance

About 5M character/s on my Macbook (Intel Core 2 Duo 2GHz/4G mem).

License

Rseg includes two built-in dictionaries:

The codes and others in Rseg are licensed under MIT license.

Feedback

All feedback are welcome, Yuanyi Zhang(zhangyuanyi#gmail.com)