Class: Eluka::Model

Inherits:
Object
  • Object
show all
Includes:
Ferret::Analysis
Defined in:
lib/eluka/model.rb

Overview

A binary classifier classifies data into two classes given a category (the class label)

  1. Data which is indicative of the category – positive data

  2. Data which is not indicative of the category – negative data

Model

A classifier model observes positve and negative data and learns the properties of each set. In the future if given an unlabelled data point it decides whether the the data point is a positive or negative instance of the category.

Internal Data Representation

A classifier model internally represents a data instance as a point in a vector space The dimensions of the vector space are termed as features

Eluka::Model

An Eluka model takes a hash of features and their values and internally processes them as points in a vector space. If the input is a string of words like in a document then it relies on Ferret’s text anaysis modules to convert it into a data point

Instance Method Summary collapse

Constructor Details

#initialize(params = {}) ⇒ Model

Initialize the classifier with sane defaults if customised data is not provided



29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/eluka/model.rb', line 29

def initialize (params = {})
  #Set the labels
  @labels             = Bijection.new
  @labels[:positive]  =  1
  @labels[:negative]  = -1
  @labels[:unknown]   =  0
  
  @gem_root           = File.expand_path(File.join(File.dirname(__FILE__), '..'))
  @bin_dir            = File.expand_path(File.join(File.dirname(@gem_root), 'bin'))

  @analyzer           = StandardAnalyzer.new
  @features           = Eluka::Features.new
  @fv_train           = Eluka::FeatureVectors.new(@features, true)
  @fv_test            = nil
  
  @directory          = (params[:directory]         or "/tmp")
  @svm_train_path     = (params[:svm_train_path]    or "#{@bin_dir}/eluka-svm-train")
  @svm_scale_path     = (params[:svm_scale_path]    or "#{@bin_dir}/eluka-svm-scale")
  @svm_predict_path   = (params[:svm_predict_path]  or "#{@bin_dir}/eluka-svm-predict")
  @grid_py_path       = (params[:grid_py_path]      or "python rsvm/tools/grid.py")
  @fselect_py_path    = (params[:fselect_py_path]   or "python rsvm/tools/fselect.py")
  @verbose            = (params[:verbose]           or false)
  
  #Convert directory to absolute path
  Dir.chdir(@directory) do @directory = Dir.pwd end
end

Instance Method Details

#add(data, label) ⇒ Object

Add a data point to the training data



58
59
60
61
62
63
# File 'lib/eluka/model.rb', line 58

def add (data, label)
      raise "No meaningful label associated with data" unless ([:positive, :negative].include? label)
  
      data_point = Eluka::DataPoint.new(data, @analyzer)
      @fv_train.add(data_point.vector, @labels[label])
end

#build(features = nil) ⇒ Object

Build a model from the training data using LibSVM



67
68
69
70
71
72
73
74
75
76
# File 'lib/eluka/model.rb', line 67

def build (features = nil)
  File.open(@directory + "/train", "w") do |f| f.puts @fv_train.to_libSVM(features) end
  
  output = `#{@svm_train_path} #{@directory}/train #{@directory}/model`
  
  puts output if (@verbose)
  
  @fv_test  = Eluka::FeatureVectors.new(@features, false)
  return output
end

#classify(data, features = nil) ⇒ Object

Classify a data point



80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/eluka/model.rb', line 80

def classify (data, features = nil)
  raise "Untrained model" unless (@fv_test)
  
  data_point = Eluka::DataPoint.new(data, @analyzer)
  @fv_test.add(data_point.vector)
  
  File.open(@directory + "/classify", "w") do |f| f.puts @fv_test.to_libSVM(features) end
  output = `#{@svm_predict_path} #{@directory}/classify #{@directory}/model #{@directory}/result`
  
  puts output if (@verbose)
  
  return @labels.lookup( File.open( @directory + "/result", "r" ).read.to_i )
end

#suggest_featuresObject

Suggests the best set of features chosen using fselect.py IMPROVE: Depending on fselect.py (an unnecessary python dependency) is stupid TODO: Finish wirting fselect.rb and integrate it



98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# File 'lib/eluka/model.rb', line 98

def suggest_features 
  sel_features = Array.new
  
  File.open(@directory + "/train", "w") do |f| f.puts @fv_train.to_libSVM end
  
  Dir.chdir('./rsvm/bin/tools') do
    output = `python fselect.py #{@directory}/train`
  
    puts output if (@verbose)
    
    x = File.read("train.select")
    sel_f_ids = x[1..-2].split(", ")
    sel_f_ids.each do |f|
      s_f = @features.term(f.to_i)
      if s_f.instance_of? String then
        s_f     = s_f.split("||")
        s_f[0]  = s_f[0].to_sym
      end
      sel_features.push(s_f)
    end
    
    #Remove temporary files
    File.delete("train.select") if File.exist?("train.select")
    File.delete("train.fscore") if File.exist?("train.fscore")
    File.delete("train.tr.out") if File.exist?("train.tr.out")
  end
  
  return sel_features
end