Class: Natto::MeCab

Inherits:
Object
  • Object
show all
Includes:
Binding, OptionParse
Defined in:
lib/natto/natto.rb

Overview

MeCab is a class providing an interface to the MeCab library. Options to the MeCab Model, Tagger and Lattice are passed in as a string (MeCab command-line style) or as a Ruby-style hash at initialization.

Usage

require 'natto'

text = '凡人にしか見えねえ風景ってのがあるんだよ。'

nm = Natto::MeCab.new
=> #<Natto::MeCab:0x0000080318d278                                  \
     @model=#<FFI::Pointer address=0x000008039174c0>,               \
     @tagger=#<FFI::Pointer address=0x0000080329ba60>,              \
     @lattice=#<FFI::Pointer address=0x000008045bd140>,             \
     @libpath="/usr/local/lib/libmecab.so"                          \
     @options={},                                                   \
     @dicts=[#<Natto::DictionaryInfo:0x0000080318ce90               \
               @filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
               charset=utf8,                                        \
               type=0>],                                            \
     @version=0.996>

# print entire MeCab result to stdout
#
puts nm.parse(text)
凡人    名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン
に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
しか    助詞,係助詞,*,*,*,*,しか,シカ,シカ
見え    動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ
ねえ    助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー
風景    名詞,一般,*,*,*,*,風景,フウケイ,フーケイ
って    助詞,格助詞,連語,*,*,*,って,ッテ,ッテ
の      名詞,非自立,一般,*,*,*,の,ノ,ノ
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
ある    動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
ん      名詞,非自立,一般,*,*,*,ん,ン,ン
だ      助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ
よ      助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
。      記号,句点,*,*,*,*,。,。,。
EOS


# pass a block to iterate over each MeCabNode instance
#
nm.parse(text) do |n| 
  puts "#{n.surface},#{n.feature}" if !n.is_eos?
end 
凡人,名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン 
に,助詞,格助詞,一般,*,*,*,に,ニ,ニ 
しか,助詞,係助詞,*,*,*,*,しか,シカ,シカ 
見え,動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ 
ねえ,助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー 
風景,名詞,一般,*,*,*,*,風景,フウケイ,フーケイ 
って,助詞,格助詞,連語,*,*,*,って,ッテ,ッテ 
の,名詞,非自立,一般,*,*,*,の,ノ,ノ 
が,助詞,格助詞,一般,*,*,*,が,ガ,ガ 
ある,動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル 
ん,名詞,非自立,一般,*,*,*,ん,ン,ン 
だ,助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ 
よ,助詞,終助詞,*,*,*,*,よ,ヨ,ヨ 
。,記号,句点,*,*,*,*,。,。,。 


# customize MeCabNode feature attribute with node-formatting
# %m   ... morpheme surface
# %F,  ... comma-delimited ChaSen feature values
#          reading (index 7) 
#          part-of-speech (index 0) 
# %h   ... part-of-speech ID (IPADIC)
#
nm = Natto::MeCab.new('-F%m,%F,[7,0],%h')

# Enumerator effectively iterates the MeCabNodes
#
enum = nm.enum_parse(text)
=> #<Enumerator: #<Enumerator::Generator:0x29cc5f8>:each>

# output the feature attribute of each MeCabNode
# only output normal nodes, ignoring any end-of-sentence 
# or unknown nodes 
#
enum.map.with_index {|n,i| puts "#{i}: #{n.feature}" if n.is_nor?} 
0: 凡人,ボンジン,名詞,38
1: に,ニ,助詞,13
2: しか,シカ,助詞,16
3: 見え,ミエ,動詞,31
4: ねえ,ネー,助動詞,25
5: 風景,フーケイ,名詞,38
6: って,ッテ,助詞,15
7: の,ノ,名詞,63
8: が,ガ,助詞,13
9: ある,アル,動詞,31
10: ん,ン,名詞,63
11: だ,ダ,助動詞,25
12: よ,ヨ,助詞,17
13: 。,。,記号,7


# Boundary constraint parsing with output formatting.
# %m   ... morpheme surface
# %f   ... tab-delimited ChaSen feature values
#          part-of-speech (index 0) 
# %2   ... MeCab node status value (1 unknown)
#
nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')

enum = nm.enum_parse(text, boundary_constraint: /見えねえ風景/)
=> #<Enumerator: #<Enumerator::Generator:0x00000801d7aa38>:each>

# output the feature attribute of each MeCabNode
# ignoring any beginning- or end-of-sentence nodes
#
enum.each do |n|
  puts n.feature if !(n.is_bos? or n.is_eos?)
end
凡人, 名詞, 0
に, 助詞, 0
しか, 助詞, 0
見えねえ風景, 名詞, 1
って, 助詞, 0
の, 名詞, 0
が, 助詞, 0
ある, 動詞, 0
ん, 名詞, 0
だ, 助動詞, 0
よ, 助詞, 0
。, 記号, 0

Constant Summary collapse

MECAB_LATTICE_ONE_BEST =
1
MECAB_LATTICE_NBEST =
2
MECAB_LATTICE_PARTIAL =
4
MECAB_LATTICE_MARGINAL_PROB =
8
MECAB_LATTICE_ALTERNATIVE =
16
MECAB_LATTICE_ALL_MORPHS =
32
MECAB_LATTICE_ALLOCATE_SENTENCE =
64
MECAB_ANY_BOUNDARY =
0
MECAB_TOKEN_BOUNDARY =
1
MECAB_INSIDE_TOKEN =
2

Constants included from OptionParse

OptionParse::SUPPORTED_OPTS, OptionParse::WARNING_LATTICE_LEVEL

Constants included from Binding

Binding::MECAB_PATH

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Binding

find_library

Constructor Details

#initialize(options = {}) ⇒ MeCab

Initializes the wrapped Tagger instance with the given options.

Options supported are:

  • :rcfile -- resource file
  • :dicdir -- system dicdir
  • :userdic -- user dictionary
  • :lattice_level -- lattice information level (DEPRECATED)
  • :output_format_type -- output format type (wakati, chasen, yomi, etc.)
  • :all_morphs -- output all morphs (default false)
  • :nbest -- output N best results (integer, default 1), requires lattice level >= 1
  • :partial -- partial parsing mode
  • :marginal -- output marginal probability
  • :max_grouping_size -- maximum grouping size for unknown words (default 24)
  • :node_format -- user-defined node format
  • :unk_format -- user-defined unknown node format
  • :bos_format -- user-defined beginning-of-sentence format
  • :eos_format -- user-defined end-of-sentence format
  • :eon_format -- user-defined end-of-NBest format
  • :unk_feature -- feature for unknown word
  • :input_buffer_size -- set input buffer size (default 8192)
  • :allocate_sentence -- allocate new memory for input sentence
  • :theta -- temperature parameter theta (float, default 0.75)
  • :cost_factor -- cost factor (integer, default 700)

MeCab command-line arguments (-F) or long (--node-format) may be used in addition to Ruby-style hashs

Use single-quotes to preserve format options that contain escape chars.
e.g.

nm = Natto::MeCab.new(node_format: '%m¥t%f[7]¥n')
=> #<Natto::MeCab:0x00000803503ee8                                 \
     @model=#<FFI::Pointer address=0x00000802b6d9c0>,              \
     @tagger=#<FFI::Pointer address=0x00000802ad3ec0>,             \
     @lattice=#<FFI::Pointer address=0x000008035f3980>,            \
     @libpath="/usr/local/lib/libmecab.so",                        \
     @options={:node_format=>"%m¥t%f[7]¥n"},                       \
     @dicts=[#<Natto::DictionaryInfo:0x000008035038f8              \
               @filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic" \
               charset=utf8,                                       \
               type=0>]                                            \
     @version=0.996>

puts nm.parse('才能とは求める人間に与えられるものではない。')
才能    サイノウ
と      ト
は      ハ
求      モトメル
人間    ニンゲン
に      ニ
与え    アタエ
られる  ラレル
もの    モノ
で      デ
は      ハ
ない    ナイ
。      。
EOS

Parameters:

  • options (Hash, String) (defaults to: {})

    the MeCab options

Raises:

  • (MeCabError)

    if MeCab cannot be initialized with the given options



228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
# File 'lib/natto/natto.rb', line 228

def initialize(options={})
  @options = self.class.parse_mecab_options(options) 
  opt_str  = self.class.build_options_str(@options)

  @model   = self.class.mecab_model_new2(opt_str)
  if @model.address == 0x0
    raise MeCabError.new("Could not initialize Model with options: '#{opt_str}'")
  end

  @tagger  = self.class.mecab_model_new_tagger(@model)
  if @tagger.address == 0x0
    raise MeCabError.new("Could not initialize Tagger with options: '#{opt_str}'")
  end

  @lattice = self.class.mecab_model_new_lattice(@model)
  if @lattice.address == 0x0
    raise MeCabError.new("Could not initialize Lattice with options: '#{opt_str}'")
  end

  @libpath = self.class.find_library

  if @options[:nbest] && @options[:nbest] > 1
    self.mecab_lattice_set_request_type(@lattice, MECAB_LATTICE_NBEST)
  else
    self.mecab_lattice_set_request_type(@lattice, MECAB_LATTICE_ONE_BEST)
  end
  if @options[:partial]
    self.mecab_lattice_add_request_type(@lattice, MECAB_LATTICE_PARTIAL)
  end
  if @options[:marginal]
    self.mecab_lattice_add_request_type(@lattice,
                                        MECAB_LATTICE_MARGINAL_PROB)
  end
  if @options[:all_morphs]
    # required when node parsing
    #self.mecab_lattice_add_request_type(@lattice, MECAB_LATTICE_NBEST)
    self.mecab_lattice_add_request_type(@lattice,
                                        MECAB_LATTICE_ALL_MORPHS)
  end
  if @options[:allocate_sentence]
    self.mecab_lattice_add_request_type(@lattice, 
                                        MECAB_LATTICE_ALLOCATE_SENTENCE)
  end

  if @options[:theta]
    self.mecab_lattice_set_theta(@lattice, @options[:theta]) 
  end

  @parse_tostr = ->(text, constraints) {
    begin
      if @options[:nbest] && @options[:nbest] > 1
        n = @options[:nbest]
      else
        n = 1
      end

      if constraints[:boundary_constraints]
        tokens = tokenize_by_pattern(text,
                                     constraints[:boundary_constraints])
        text = tokens.map {|t| t.first}.join
        self.mecab_lattice_set_sentence(@lattice, text)

        bpos = 0
        tokens.each do |token|
          c = token.first.bytes.count

          self.mecab_lattice_set_boundary_constraint(@lattice,
                                                     bpos,
                                                     MECAB_TOKEN_BOUNDARY)
          bpos += 1

          mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
          (c-1).times do
            self.mecab_lattice_set_boundary_constraint(@lattice,
                                                       bpos,
                                                       mark)
            bpos += 1
          end
        end
      elsif constraints[:feature_constraints]
        features = constraints[:feature_constraints]
        tokens = tokenize_by_features(text,
                                      features.keys)
        text = tokens.map {|t| t.first}.join
        self.mecab_lattice_set_sentence(@lattice, text)

        bpos = 0
        tokens.each do |token|
          chunk = token.first
          c = chunk.bytes.count
          if token.last
            self.mecab_lattice_set_feature_constraint(@lattice,
                                                      bpos,
                                                      bpos+c,
                                                      features[chunk])
          end
          bpos += c
        end
      else
        self.mecab_lattice_set_sentence(@lattice, text)
      end

      self.mecab_parse_lattice(@tagger, @lattice)
      
      if n > 1
        retval = self.mecab_lattice_nbest_tostr(@lattice, n)
      else
        retval = self.mecab_lattice_tostr(@lattice)
      end
      retval.force_encoding(Encoding.default_external)
    rescue => ex
      message = self.mecab_lattice_strerror(@lattice)
      raise ex if message == ''
      raise MeCabError.new(message)
    end
  }
    
  @parse_tonodes = ->(text, constraints) {
    Enumerator.new do |y|
      begin
        if @options[:nbest] && @options[:nbest] > 1
          n = @options[:nbest]
        else
          n = 1
        end

        if constraints[:boundary_constraints]
          tokens = tokenize_by_pattern(text,
                                       constraints[:boundary_constraints])
          text = tokens.map {|t| t.first}.join
          self.mecab_lattice_set_sentence(@lattice, text)

          bpos = 0
          tokens.each do |token|
            c = token.first.bytes.count

            self.mecab_lattice_set_boundary_constraint(@lattice,
                                                       bpos,
                                                       MECAB_TOKEN_BOUNDARY)
            bpos += 1

            mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
            (c-1).times do
              self.mecab_lattice_set_boundary_constraint(@lattice, bpos, mark)
              bpos += 1
            end
          end
        elsif constraints[:feature_constraints]
          features = constraints[:feature_constraints]
          tokens = tokenize_by_features(text,
                                        features.keys)
          text = tokens.map {|t| t.first}.join
          self.mecab_lattice_set_sentence(@lattice, text)

          bpos = 0
          tokens.each do |token|
            chunk = token.first
            c = chunk.bytes.count
            if token.last
              self.mecab_lattice_set_feature_constraint(@lattice,
                                                        bpos,
                                                        bpos+c,
                                                        features[chunk])
            end
            bpos += c
          end
        else
          self.mecab_lattice_set_sentence(@lattice, text)
        end

        self.mecab_parse_lattice(@tagger, @lattice)

        n.times do
          check = self.mecab_lattice_next(@lattice)
          if check
            nptr = self.mecab_lattice_get_bos_node(@lattice)
      
            while nptr && nptr.address!=0x0
              mn = Natto::MeCabNode.new(nptr)
              if !mn.is_bos?
                surf = mn[:surface].bytes.to_a.slice(0,mn.length).pack('C*')
                mn.surface = surf.force_encoding(Encoding.default_external)
                if @options[:output_format_type] || @options[:node_format]
                  mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
                end
                y.yield mn
              end
              nptr = mn[:next]
            end
          end
        end
        nil
      rescue => ex
        message = self.mecab_lattice_strerror(@lattice)
        raise ex if message == ''
        raise MeCabError.new(message)
      end
    end
  }

  @dicts = []
  @dicts << Natto::DictionaryInfo.new(self.mecab_model_dictionary_info(@model))
  while @dicts.last.next.address != 0x0
    @dicts << Natto::DictionaryInfo.new(@dicts.last.next)
  end

  @version = self.mecab_version

  ObjectSpace.define_finalizer(self, self.class.create_free_proc(@model,
                                                                 @tagger,
                                                                 @lattice))
end

Instance Attribute Details

#dictsArray (readonly)

Returns listing of all of dictionaries referenced.

Returns:

  • (Array)

    listing of all of dictionaries referenced.



164
165
166
# File 'lib/natto/natto.rb', line 164

def dicts
  @dicts
end

#latticeFFI:Pointer (readonly)

Returns pointer to MeCab Lattice.

Returns:

  • (FFI:Pointer)

    pointer to MeCab Lattice.



158
159
160
# File 'lib/natto/natto.rb', line 158

def lattice
  @lattice
end

#libpathString (readonly)

Returns absolute filepath to MeCab library.

Returns:

  • (String)

    absolute filepath to MeCab library.



160
161
162
# File 'lib/natto/natto.rb', line 160

def libpath
  @libpath
end

#modelFFI:Pointer (readonly)

Returns pointer to MeCab Model.

Returns:

  • (FFI:Pointer)

    pointer to MeCab Model.



154
155
156
# File 'lib/natto/natto.rb', line 154

def model
  @model
end

#optionsHash (readonly)

Returns MeCab options as key-value pairs.

Returns:

  • (Hash)

    MeCab options as key-value pairs.



162
163
164
# File 'lib/natto/natto.rb', line 162

def options
  @options
end

#taggerFFI:Pointer (readonly)

Returns pointer to MeCab Tagger.

Returns:

  • (FFI:Pointer)

    pointer to MeCab Tagger.



156
157
158
# File 'lib/natto/natto.rb', line 156

def tagger
  @tagger
end

#versionString (readonly)

Returns MeCab version.

Returns:

  • (String)

    MeCab version.



166
167
168
# File 'lib/natto/natto.rb', line 166

def version
  @version
end

Class Method Details

.create_free_proc(mptr, tptr, lptr) ⇒ Proc

Returns a Proc that will properly free resources when this instance is garbage collected.

Parameters:

  • mptr (FFI::Pointer)

    pointer to Model

  • tptr (FFI::Pointer)

    pointer to Tagger

  • lptr (FFI::Pointer)

    pointer to Lattice

Returns:

  • (Proc)

    to release MeCab resources properly



573
574
575
576
577
578
579
# File 'lib/natto/natto.rb', line 573

def self.create_free_proc(mptr, tptr, lptr)
  Proc.new do
    self.mecab_lattice_destroy(lptr)
    self.mecab_destroy(tptr)
    self.mecab_model_destroy(mptr)
  end
end

Instance Method Details

#enum_parse(text, constraints = {}) ⇒ Enumerator

Parses the given string text, returning an Enumerator that may be used to iterate over the resulting Natto::MeCabNode objects. This is more efficient than parsing to a simple string, since each node's information will not be materialized all at once as it is with string output.

MeCab nodes contain much more detailed information about the morpheme. Node-formatting may also be used to customize the resulting node's feature attribute.

Boundary constraint parsing is available by passing in the boundary_constraints key in the options hash. Boundary constraints parsing provides hints to MeCab on where the morpheme boundaries in the given text are located. boundary_constraints value may be either a Regexp or String; please see String#scan

Feature constraint parsing is available by passing in the feature_constraints key in the options hash. Feature constraints parsing provides instructions to MeCab to use the feature indicated for any morpheme that is an exact match for the given key. feature_constraints is a hash mapping a specific morpheme (String) to a corresponding feature value (String).

Parameters:

  • text (String)

    the Japanese text to parse

  • constraints (Hash) (defaults to: {})

    boundary_constraints or feature_constraints

Returns:

  • (Enumerator)

    of MeCabNode instances

Raises:

  • (MeCabError)

    if the MeCab Tagger cannot parse the given text

  • (ArgumentError)

    if the given string text argument is nil

See Also:



517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
# File 'lib/natto/natto.rb', line 517

def enum_parse(text, constraints={})
  if text.nil?
    raise ArgumentError.new 'Text to parse cannot be nil'
  elsif constraints[:boundary_constraints]
    if !(constraints[:boundary_constraints].is_a?(Regexp) ||
         constraints[:boundary_constraints].is_a?(String))
      raise ArgumentError.new 'boundary constraints must be a Regexp or String'
    end
  elsif constraints[:feature_constraints] && !constraints[:feature_constraints].is_a?(Hash)
    raise ArgumentError.new 'feature constraints must be a Hash'
  elsif @options[:partial] && !text.end_with?("\n")
    raise ArgumentError.new 'partial parsing requires new-line char at end of text'
  end

  @parse_tonodes.call(text, constraints)
end

#inspectString

Overrides Object#inspect.

Returns:

  • (String)

    encoded object id, FFI pointer, options hash, list of dictionaries, and MeCab version

See Also:



563
564
565
# File 'lib/natto/natto.rb', line 563

def inspect
  self.to_s
end

#parse(text, constraints = {}) ⇒ String

Parses the given text, returning the MeCab output as a single string. If a block is passed to this method, then node parsing will be used and each node yielded to the given block.

Boundary constraint parsing is available via passing in the boundary_constraints key in the options hash. Boundary constraints parsing provides hints to MeCab on where the morpheme boundaries in the given text are located. boundary_constraints value may be either a Regexp or String; please see String#scan The boundary constraint parsed output will be returned as a single string, unless a block is passed to this method for node parsing.

Feature constraint parsing is available by passing in the feature_constraints key in the options hash. Feature constraints parsing provides instructions to MeCab to use the feature indicated for any morpheme that is an exact match for the given key. feature_constraints is a hash mapping a specific morpheme (String) to a corresponding feature value (String).

Parameters:

  • text (String)

    the Japanese text to parse

  • constraints (Hash) (defaults to: {})

    boundary_constraints or feature_constraints

Returns:

  • (String)

    parsing result from MeCab

Raises:

  • (MeCabError)

    if the MeCab Tagger cannot parse the given text

  • (ArgumentError)

    if the given string text argument is nil

See Also:



465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
# File 'lib/natto/natto.rb', line 465

def parse(text, constraints={})
  if text.nil?
    raise ArgumentError.new 'Text to parse cannot be nil'
  elsif constraints[:boundary_constraints]
    if !(constraints[:boundary_constraints].is_a?(Regexp) ||
         constraints[:boundary_constraints].is_a?(String))
      raise ArgumentError.new 'boundary constraints must be a Regexp or String'
    end
  elsif constraints[:feature_constraints] && !constraints[:feature_constraints].is_a?(Hash)
    raise ArgumentError.new 'feature constraints must be a Hash'
  elsif @options[:partial] && !text.end_with?("\n")
    raise ArgumentError.new 'partial parsing requires new-line char at end of text'
  end

  if block_given?
    @parse_tonodes.call(text, constraints).each {|n| yield n }
  else
    @parse_tostr.call(text, constraints)
  end
end

#to_sString

Returns human-readable details for the wrapped MeCab library. Overrides Object#to_s.

  • encoded object id
  • underlying FFI pointer to the MeCab Model
  • underlying FFI pointer to the MeCab Tagger
  • underlying FFI pointer to the MeCab Lattice
  • real file path to MeCab library
  • options hash
  • list of dictionaries
  • MeCab version

Returns:

  • (String)

    encoded object id, underlying FFI pointer, file path to MeCab library, options hash, list of dictionaries and MeCab version



548
549
550
551
552
553
554
555
556
557
# File 'lib/natto/natto.rb', line 548

def to_s
  [ super.chop,
    "@model=#{@model},", 
    "@tagger=#{@tagger},", 
    "@lattice=#{@lattice},", 
    "@libpath=\"#{@libpath}\",",
    "@options=#{@options.inspect},", 
    "@dicts=#{@dicts.to_s},", 
    "@version=#{@version.to_s}>" ].join(' ')
end