chawan

A cup for chasen that provides an easy to use for extracting Japanese

Methods

* Chawan.parse(text)
  parse the given text by analyzer, where default analyzer is :mecab

* Chawan.analyzer(xxx)      (same as Chawan[xxx], Chawan.xxx)
  specify analyzer

Class

* Chawan::Nodes (Chawan.parse returns a Chawan::Nodes)
    #noun    : scope category with noun
    #verb    : scope category with verb
    #grep    : scope category with given pattern
    #compact : mix the category-consecutive nodes

* Chawan::Node (Chawan::Nodes has many Chawan::Node(s))
    #category   : part of speech
    #word       : text
    #attributes : keys and vals hash

Example

text = '登録された利用者'

# 'parse' returns a Chawan::Nodes
Chawan.parse(text)
=> [<名詞: '登録'>, <動詞: 'さ'>, <動詞: 'れ'>, <助動詞: 'た'>, <名詞: '利用'>, <名詞: '者'>]

# Chawan::Nodes is enumerable
Chawan.parse(text).select{|node| node.category == '名詞'}
=> [<名詞: '登録'>, <名詞: '利用'>, <名詞: '者'>]

# gateway interface: noun
Chawan.parse(text).noun
=> [<名詞: '登録'>, <名詞: '利用'>, <名詞: '者'>]

# gateway interface: verb
Chawan.parse(text).verb
=> [<動詞: 'さ'>, <動詞: 'れ'>, <助動詞: 'た'>]

# gateway interface: grep
Chawan.parse(text).grep(/動詞/)
=> [<動詞: 'さ'>, <動詞: 'れ'>, <助動詞: 'た'>]
Chawan.parse(text).grep('動詞')
=> [<動詞: 'さ'>, <動詞: 'れ'>]

# gateway interface: compact
Chawan.parse(text).compact
=> [<名詞: '登録'>, <動詞: 'され'>, <助動詞: 'た'>, <名詞: '利用者'>]
Chawan.parse(text).compact(/動詞/)
=> [<名詞: '登録'>, <動詞: 'された'>, <名詞: '利用'>, <名詞: '者'>]

# gateway interface is chainable
Chawan.parse(text).noun.verb
=> []

# chainable is fun!
Chawan.parse(text).noun
=> [<名詞: '登録'>, <名詞: '利用'>, <名詞: '者'>]
Chawan.parse(text).compact.noun
=> [<名詞: '登録'>, <名詞: '利用者'>]
Chawan.parse(text).noun.compact
=> [<名詞: '登録利用者'>]

Analyzer

Parser engine is defined as 'analyzer'.
Available analyzers are:

  * mecab : (default)
  * chasen

Chawan[:mecab].parse('test')
=> [<名詞: 'test'>]

# same as
#   Chawan.mecab.parse('test')
#   Chawan.analyzer(:mecab).parse('test')
#   Chawan.parse('test')  # default analyzer is :mecab

Chawan[:chasen].parse('test')
=> [<記号: 't'>, <記号: 'e'>, <記号: 's'>, <記号: 't'>]

Required

* UTF-8
* 'mecab' unix command (and its path)

Todo

* use open3 rather than backquote for executing unix commands

Author

maiha@wota.jp