Paul Dix ([email protected])
This is Daniel DeLeo’s fork of Paul Dix’s basset, a library for machine learning.
Basset includes a generic document representation class, a feature selector, a feature extractor, naive bayes and SVM classifiers, and a classification evaluator for running tests. The goal is to create a general framework that is easy to modify for specific problems. It is designed to be extensible so it should be easy to add more classification and clustering algorithms.
I have added lots of tests (TATFT!), a document class for URIs, a high level classifier class, SVM support using libsvm-ruby-swig and an anomaly detector (one-class classifier). I have also modified the naive bayes classifier so that it does not require you to use the feature selector and extractor. This should help you to get encouraging results when you first start. Once you have a large training dataset, you can use the feature selector and extractor for better performance.
The most popular task is spam identification, though there are many others, such as document retrieval, security/intrusion detection, and applications in Biology and Astrophysics.
To build a classifier, you’ll first need a set of training documents. For a spam/non-spam classifier, this would consist of a number of documents which you have labeled as either spam or not. With training sets, bigger is better. You should probably have at least 100 of each type (spam and not spam). Really 1,000 of each type would be better and 10,000 of each would be super sweet.
The Classifier class takes care of all of the messy details for you. It defaults to a naive bayes classifier and using the Document (best for western natural language) class to represent documents.
# This example based on the song “Losing My Edge” by LCD Soundsystem
classifier = Basset::Classifier.new #default options
classifier.train(:hip, “turntables”, “techno music”, “DJs with turntables”, “techno DJs”) classifier.train(:unhip, “rock music”, “guitar bass drums”, “guitar rock”, “guitar players”) classifier.classify(“guitar music”) => :unhip
# now everyone likes rock music again! retrain the classifier fast!
classifier.train_iterative(:hip, “guitars”) # takes 3 iterations classifier.classify(“guitars”) => :hip
For more control over the various stages of the training and classification process, you can create document, feature selector, feature extractor, and document classifier objects directly. The process is as follows:
Create each as a Document (a class in this library)
Pass those documents into the FeatureSelector
Get the best features and pass those into the FeatureExtractor
Now extract features from each document using the extractor and
Pass those extracted features to NaiveBayes or Svm as part of the training set
Now you can save the FeatureExtractor and NaiveBayes (or Svm) to a file
That represents the process of selecting features and training the classifier. Once you’ve done that you can predict if a new previously unseen document is spam or not by just doing the following:
Load the feature extractor and document classifier from their files
Create a new document object from your new unseen document
Extract the features from that document using the feature extractor and
Pass those to the classify method of the naive bayes classifier
Something that you’ll probably want to do before doing real classification is to test things. Use the ClassificationEvaluator for this. Using the evaluator you can pass your training documents in and have it run through a series of tests to give you an estimate of how successful the classifier will be at predicting unseen documents. Easy classification tasks will generally be > 90% accurate while others can be much harder. Each classification task is different and most of the time you won’t know until you actually test it out.
I love machine learning and classification so if you have a problem that is giving you trouble don’t hesitate to get a hold of me. The same applies for anyone who wants to write additional classifiers, better document representations, or just to tell my my code is amateur.