Class: Basset::FeatureExtractor

Inherits:
Object
  • Object
show all
Includes:
YamlSerialization
Defined in:
lib/basset/feature_extractor.rb

Overview

Extracts features from a document. On initialization it expects the set of features that are to be extracted from documents. The extracted features will just be numbered in ascending order. This makes it easy to output feature sets for libraries like svmlight.

Instance Method Summary collapse

Methods included from YamlSerialization

included, #save_to_file

Constructor Details

#initialize(feature_names) ⇒ FeatureExtractor

the constructor takes an array of feature names. These are the features that will be extracted from documents. All others will be ignored.



13
14
15
16
# File 'lib/basset/feature_extractor.rb', line 13

def initialize(feature_names)
  @feature_names = {}
  feature_names.each_with_index {|feature_name, index| @feature_names[feature_name] = index + 1}
end

Instance Method Details

#extract(document) ⇒ Object

just returns the features from the document that the extractor is interested in



33
34
35
36
37
# File 'lib/basset/feature_extractor.rb', line 33

def extract(document)
  document.vector_of_features.find_all do |feature|
    @feature_names[feature.name]
  end
end

#extract_numbered(document) ⇒ Object

returns an array of features, but with their names replaced with an integer identifier. They should be sorted in ascending identifier order. This is a generic representation that works well with other machine learning packages like svm_light.



25
26
27
28
29
30
# File 'lib/basset/feature_extractor.rb', line 25

def extract_numbered(document)
  numbered_features = extract(document).collect do |feature|
    Feature.new(@feature_names[feature.name], feature.value)
  end
  numbered_features.sort
end

#number_of_featuresObject



18
19
20
# File 'lib/basset/feature_extractor.rb', line 18

def number_of_features
  @feature_names.size
end