Ansj中文分词 for JRuby
引用原项目摘要
这是一个基于 n-Gram + CRF + HMM 的中文分词的 Java 实现。
分词速度达到每秒钟大约 200万字左右(原作者 MacbookAir 下测试),准确率能达到 96% 以上
目前实现了 中文分词、中文姓名识别、用户自定义词典、关键字提取、自动摘要、关键字标记等功能。
可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目。
Read Ansj中文分词 docs for more details: http://nlpchina.github.io/ansj_seg
Installing
Add to your Gemfile
:
gem 'ansj_seg'
Then bundle install
.
Or install it yourself as:
$ gem install ansj_seg
Usage
require 'jrjackson' # 可选项, JRuby 下最快的 json 解析器, 需要提前:gem install jrjackson -V
require 'ansj_seg'
# 设置用户(默认)词典
AnsjSeg::Config::DIC['dic'] = '/Users/howl/Desktop/library/default.dic'
# 设置用户自定义词典,key 的规则是以 dic_ 为前缀
AnsjSeg::Config::DIC['dic_souhu'] = '/Users/howl/Desktop/library/souhu.dic'
# 设置CRF模型,用户自定义CRF模型时,key 的规则是以 crf_ 为前缀
AnsjSeg::Config::CRF['crf'] = '/Users/howl/Desktop/library/crf.model'
# 设置歧义词典
AnsjSeg::Config.ambiguityLibrary = '/Users/howl/Desktop/library/ambiguity.dic'
# 设置过滤器(过滤:词性为空或标点,是和的)
AnsjSeg.fitler.ignore(natures: ['null', 'w'], words: ['是', '的'])
text = "Ruby China,对!没错!这里就是 Ruby 社区,目前这里已经是国内最权威的 Ruby 社区,拥有国内所有资深的 Ruby 工程师。"
# 分词
# 第二个参数可选::to, :nlp, :index 三种分词模式
text.to_a(:terms) # text.to_a(:terms, :nlp)
[ {:name=>"ruby", :natureStr=>"en", :newWord=>false, :offe=>0, :realName=>"ruby"}, {:name=>"china", :natureStr=>"en", :newWord=>false, :offe=>5, :realName=>"china"}, {:name=>"对", :natureStr=>"p", :newWord=>false, :offe=>11, :realName=>"对"}, {:name=>"没错", :natureStr=>"v", :newWord=>false, :offe=>13, :realName=>"没错"}, {:name=>"这里", :natureStr=>"r", :newWord=>false, :offe=>16, :realName=>"这里"}, {:name=>"就", :natureStr=>"d", :newWord=>false, :offe=>18, :realName=>"就"}, {:name=>"ruby", :natureStr=>"en", :newWord=>false, :offe=>21, :realName=>"ruby"}, {:name=>"社区", :natureStr=>"n", :newWord=>false, :offe=>26, :realName=>"社区"}, {:name=>"目前", :natureStr=>"t", :newWord=>false, :offe=>29, :realName=>"目前"}, {:name=>"这", :natureStr=>"r", :newWord=>false, :offe=>31, :realName=>"这"}, {:name=>"里", :natureStr=>"f", :newWord=>false, :offe=>32, :realName=>"里"}, {:name=>"已经", :natureStr=>"d", :newWord=>false, :offe=>33, :realName=>"已经"}, {:name=>"国内", :natureStr=>"s", :newWord=>false, :offe=>36, :realName=>"国内"}, {:name=>"最", :natureStr=>"d", :newWord=>false, :offe=>38, :realName=>"最"}, {:name=>"权威", :natureStr=>"n", :newWord=>false, :offe=>39, :realName=>"权威"}, {:name=>"ruby", :natureStr=>"en", :newWord=>false, :offe=>43, :realName=>"ruby"}, {:name=>"社区", :natureStr=>"n", :newWord=>false, :offe=>48, :realName=>"社区"}, {:name=>"拥有", :natureStr=>"v", :newWord=>false, :offe=>51, :realName=>"拥有"}, {:name=>"国内", :natureStr=>"s", :newWord=>false, :offe=>53, :realName=>"国内"}, {:name=>"所有", :natureStr=>"b", :newWord=>false, :offe=>55, :realName=>"所有"}, {:name=>"资深", :natureStr=>"b", :newWord=>false, :offe=>57, :realName=>"资深"}, {:name=>"ruby", :natureStr=>"en", :newWord=>false, :offe=>61, :realName=>"ruby"}, {:name=>"工程师", :natureStr=>"n", :newWord=>false, :offe=>66, :realName=>"工程师"} ]
# 提取关键词
# 第二个参数定义分词个数,默认:20
text.to_a(:words) # text.to_a(:words, 5)
[ {:freq=>2, :name=>"这里", :score=>16.315514814428745}, {:freq=>2, :name=>"社区", :score=>14.99970404519092}, {:freq=>2, :name=>"国内", :score=>13.684318222044968}, {:freq=>1, :name=>"目前", :score=>5.3946562994797125}, {:freq=>1, :name=>"已经", :score=>4.868333845606951}, {:freq=>1, :name=>"权威", :score=>4.078866889481741}, {:freq=>1, :name=>"所有", :score=>1.973668895869867}, {:freq=>1, :name=>"资深", :score=>1.7105130430872182}, {:freq=>1, :name=>"没错", :score=>1.4999705998354727}, {:freq=>1, :name=>"就是", :score=>1.3683942314288522}, {:freq=>1, :name=>"工程师", :score=>0.5263054050944183}, {:freq=>1, :name=>"拥有", :score=>0.4999901999451576} ]
PS. Built and tested on JRuby 9.1.6