Class: Basset::FeatureSelector

Inherits:
Object
  • Object
show all
Defined in:
lib/basset/feature_selector.rb

Overview

This class is the feature selector. All documents in the training set should be added to the selector. Once they are in, a number of features may be selected based on the chi square value. When in doubt just call feature_with_chi_value_greater_than with an empty hash. It will return all features that have at least some statistical significance and occur in more than one document.

Defined Under Namespace

Classes: FeatureValues

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeFeatureSelector

Returns a new instance of FeatureSelector.



11
12
13
14
15
# File 'lib/basset/feature_selector.rb', line 11

def initialize
  @docs           = 0
  @docs_in_class  = Hash.new(0)
  @features       = Hash.new { |h, k| h[k] = FeatureValues.new }
end

Instance Attribute Details

#docsObject (readonly)

Returns the value of attribute docs.



9
10
11
# File 'lib/basset/feature_selector.rb', line 9

def docs
  @docs
end

Instance Method Details

#add_document(document) ⇒ Object

Adds a document to the feature selector. The document should respond_to a method vector_of_features which returns a vector of unique features.



19
20
21
22
23
24
25
26
# File 'lib/basset/feature_selector.rb', line 19

def add_document(document)
  @docs += 1
  @docs_in_class[document.classification] += 1
  
  document.vector_of_features.each do |feature| 
    @features[feature.name].add_document_with_class(document.classification)
  end
end

#all_feature_namesObject

returns all features, regardless of chi_square or frequency



29
30
31
# File 'lib/basset/feature_selector.rb', line 29

def all_feature_names
  @features.keys
end

#best_features(count = 10, classification = nil) ⇒ Object

returns an array of the best features for a given classification



38
39
40
# File 'lib/basset/feature_selector.rb', line 38

def best_features(count = 10, classification = nil)
  select_features(1.0, classification).first(count)
end

#features_with_chi(classification) ⇒ Object



42
43
44
45
46
# File 'lib/basset/feature_selector.rb', line 42

def features_with_chi(classification)
  @features.keys.map do |feature_name|
    Feature.new(feature_name, chi_squared(feature_name, classification))
  end
end

#number_of_featuresObject



33
34
35
# File 'lib/basset/feature_selector.rb', line 33

def number_of_features
  @features.size
end

#select_features(chi_value = 1.0, classification = nil) ⇒ Object

returns an array of features that have a minimum or better chi_square value.



49
50
51
52
53
54
55
56
57
# File 'lib/basset/feature_selector.rb', line 49

def select_features(chi_value = 1.0, classification = nil)
  classification ||= @docs_in_class.keys.first

  selected_features = features_with_chi(classification).select do |feature|
    (docs_with_feature(feature.name) > 1) && (feature.value >= chi_value)
  end
  
  selected_features.sort_by(&:value).reverse.collect(&:name)
end