Module: CRFPP

Defined in:
lib/crfpp/macro.rb,
lib/crfpp/model.rb,
lib/crfpp/errors.rb,
lib/crfpp/feature.rb,
lib/crfpp/version.rb,
lib/crfpp/filelike.rb,
lib/crfpp/template.rb,
lib/crfpp/utilities.rb

Defined Under Namespace

Modules: Filelike Classes: Error, Feature, Macro, Model, NativeError, Template

Constant Summary collapse

VERSION =
'0.0.2'.freeze

Class Method Summary collapse

Class Method Details

.learn(template, data, options = {}) ⇒ Object

Creates a new Model based on a template and training data.

The data parameter can either be an array of strings or a filename. The possible options are:

:threads: False or the number of threads to us (default is 2).

:algorithm: L1 or L2 (default)

:cost: With this option, you can change the hyper-parameter for the CRFs.

With larger C value, CRF tends to overfit to the give training
corpus. This parameter trades the balance between overfitting and
underfitting. The results will significantly be influenced by this
parameter. You can find an optimal value by using held-out data or
more general model selection method such as cross validation.

:frequency: This parameter sets the cut-off threshold for the features. CRF++

uses the features that occurs no less than NUM times in the given training
data. The default value is 1. When you apply CRF++ to large data, the
number of unique features would amount to several millions. This option is
useful in such cases.


26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/crfpp/utilities.rb', line 26

def learn(template, data, options = {})
  options = { :threads => 2, :algorithm => :L2, :cost => 1.0, :frequency => 1}.merge(options)
  
  unless File.exists?(data)
    data = save_data_to_tempfile([data].flatten)
    temporary = true
  end

  template = Template.new(template) unless template.is_a?(Template)
  model = Model.new
  
  arguments = []
  
  # TODO check algorithm names
  # arguments << "--algorithm=#{options[:algorithm]}"
  
  arguments << "--cost=#{options[:cost]}"
  arguments << "--thread=#{options[:threads]}"
  arguments << "--freq=#{options[:frequency]}"
  
  arguments << template.path
  arguments << data
  arguments << model.path
  
  success = Native.learn(arguments.join(' '))
  raise NativeError, 'crfpp learn failed' unless success
  
  model
ensure
  data.unlink if temporary
end

.trainObject

Creates a new Model based on a template and training data.

The data parameter can either be an array of strings or a filename. The possible options are:

:threads: False or the number of threads to us (default is 2).

:algorithm: L1 or L2 (default)

:cost: With this option, you can change the hyper-parameter for the CRFs.

With larger C value, CRF tends to overfit to the give training
corpus. This parameter trades the balance between overfitting and
underfitting. The results will significantly be influenced by this
parameter. You can find an optimal value by using held-out data or
more general model selection method such as cross validation.

:frequency: This parameter sets the cut-off threshold for the features. CRF++

uses the features that occurs no less than NUM times in the given training
data. The default value is 1. When you apply CRF++ to large data, the
number of unique features would amount to several millions. This option is
useful in such cases.


58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/crfpp/utilities.rb', line 58

def learn(template, data, options = {})
  options = { :threads => 2, :algorithm => :L2, :cost => 1.0, :frequency => 1}.merge(options)
  
  unless File.exists?(data)
    data = save_data_to_tempfile([data].flatten)
    temporary = true
  end

  template = Template.new(template) unless template.is_a?(Template)
  model = Model.new
  
  arguments = []
  
  # TODO check algorithm names
  # arguments << "--algorithm=#{options[:algorithm]}"
  
  arguments << "--cost=#{options[:cost]}"
  arguments << "--thread=#{options[:threads]}"
  arguments << "--freq=#{options[:frequency]}"
  
  arguments << template.path
  arguments << data
  arguments << model.path
  
  success = Native.learn(arguments.join(' '))
  raise NativeError, 'crfpp learn failed' unless success
  
  model
ensure
  data.unlink if temporary
end