Class: Basset::FeatureExtractor

Inherits:

Object

Object
Basset::FeatureExtractor

Includes:: YamlSerialization

Defined in:: lib/basset/feature_extractor.rb

Overview

Extracts features from a document. On initialization it expects the set of features that are to be extracted from documents. The extracted features will just be numbered in ascending order. This makes it easy to output feature sets for libraries like svmlight.

Instance Method Summary collapse

#extract(document) ⇒ Object

just returns the features from the document that the extractor is interested in.
#extract_numbered(document) ⇒ Object

returns an array of features, but with their names replaced with an integer identifier.
#initialize(feature_names) ⇒ FeatureExtractor constructor

the constructor takes an array of feature names.
#number_of_features ⇒ Object

Methods included from YamlSerialization

included, #save_to_file

Constructor Details

#initialize(feature_names) ⇒ `FeatureExtractor`

the constructor takes an array of feature names. These are the features that will be extracted from documents. All others will be ignored.

# File 'lib/basset/feature_extractor.rb', line 13

def initialize(feature_names)
  @feature_names = {}
  feature_names.each_with_index {|feature_name, index| @feature_names[feature_name] = index + 1}
end

Instance Method Details

#extract(document) ⇒ `Object`

just returns the features from the document that the extractor is interested in

# File 'lib/basset/feature_extractor.rb', line 33

def extract(document)
  document.vector_of_features.find_all do |feature|
    @feature_names[feature.name]
  end
end

#extract_numbered(document) ⇒ `Object`

returns an array of features, but with their names replaced with an integer identifier. They should be sorted in ascending identifier order. This is a generic representation that works well with other machine learning packages like svm_light.

# File 'lib/basset/feature_extractor.rb', line 25

def extract_numbered(document)
  numbered_features = extract(document).collect do |feature|
    Feature.new(@feature_names[feature.name], feature.value)
  end
  numbered_features.sort
end

#number_of_features ⇒ `Object`



18
19
20

# File 'lib/basset/feature_extractor.rb', line 18

def number_of_features
  @feature_names.size
end