Class: Basset::Classifier

Inherits:
Object
  • Object
show all
Includes:
YamlSerialization
Defined in:
lib/basset/classifier.rb

Overview

Classifier wraps up all of the operations spread between Document and friends, FeatureExtractor, FeatureSelector, and specific classifiers such as NaiveBayes into one convenient interface.

Direct Known Subclasses

AnomalyDetector

Constant Summary collapse

DEFAULTS =
{:type => "naive_bayes", :doctype => "document"}

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from YamlSerialization

included, #save_to_file

Constructor Details

#initialize(opts = {}) ⇒ Classifier

Create a new classifier object. You can specify the type of classifier and kind of documents with the options. The defaults are :type => :naive_bayes, :doctype => :document; There is also a uri_document,ie. opts: {:type => :naive_bayes, :doctype => :uri_document }



22
23
24
25
# File 'lib/basset/classifier.rb', line 22

def initialize(opts={})
  @engine = constanize_opt(opts[:type] || DEFAULTS[:type]).new
  @doctype = constanize_opt(opts[:doctype] || DEFAULTS[:doctype])
end

Instance Attribute Details

#doctypeObject (readonly)

Returns the value of attribute doctype.



15
16
17
# File 'lib/basset/classifier.rb', line 15

def doctype
  @doctype
end

#engineObject (readonly)

Returns the value of attribute engine.



15
16
17
# File 'lib/basset/classifier.rb', line 15

def engine
  @engine
end

Instance Method Details

#==(other) ⇒ Object



63
64
65
# File 'lib/basset/classifier.rb', line 63

def ==(other)
  other.is_a?(self.class) && other.engine == engine && other.doctype == doctype
end

#classify(text) ⇒ Object

Classifies text based on training



50
51
52
# File 'lib/basset/classifier.rb', line 50

def classify(text)
  classify_features(features_of(text)).last
end

#similarity_score(classification, text) ⇒ Object

Gives a numeric value for the similarity of text to previously seen texts of class classification. For a Naive Bayes filter, this will be the log10 of the probabilities of each token in text occuring in a text of class classification, normalized for the number of tokens.



59
60
61
# File 'lib/basset/classifier.rb', line 59

def similarity_score(classification, text)
  similarity_score_for_features(classification, features_of(text))
end

#train(classification, *texts) ⇒ Object

Trains the classifier with texts of class classification. texts gets flattened, so you can pass in an array without breaking anything.



31
32
33
34
35
# File 'lib/basset/classifier.rb', line 31

def train(classification, *texts)
  texts.flatten.each do |text| 
    train_with_features(classification, features_of(text, classification))
  end
end

#train_iterative(classification, text) ⇒ Object

Trains the classifier on a text repeatedly until the classifier recognizes it as being in class classification (up to a maximum of 5 retrainings). Handy for training the classifier quickly or when it has been mistrained.



41
42
43
44
45
46
# File 'lib/basset/classifier.rb', line 41

def train_iterative(classification, text)
  (1 .. 5).each do |i|
    train(classification, text)
    break if classify(text) == classification
  end
end