Weka

Gem Version Travis Build

Machine Learning & Data Mining with JRuby based on the Weka Java library.

Installation

Add this line to your application's Gemfile:

gem 'weka'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install weka

Usage

Start using Weka's Machine Learning and Data Mining algorithms by requiring the gem:

require 'weka'

The weka gem tries to carry over the namespaces defined in Weka and enhances some interfaces in order to allow a more Ruby-ish programming style when using the Weka library.

The idea behind keeping the namespaces is, that you can also use the Weka documentation for looking up functionality and classes.

Analog to the Weka doc you can find the following namespaces:

Namespace Description
Weka::Core defines base classes for loading, saving, creating, and editing a dataset
Weka::Classifiers defines classifier classes in different sub-modules (Bayes, Functions, Lazy, Meta, Rules, and Trees )
Weka::Filters defines filter classes for processing datasets in the Supervised or Unsupervised, and Attribute or Instance sub-modules
Weka::Clusterers defines clusterer classes
Weka::AttributeSelection defines classes for selecting attributes from a dataset

Instances

Instances objects hold the dataset that is used to train a classifier or that should be classified based on training data.

Instances can be loaded from files and saved to files. Supported formats are ARFF, CSV, and JSON.

Loading Instances from a file

Instances can be loaded from ARFF, CSV, and JSON files:

instances = Weka::Core::Instances.from_arff('weather.arff')
instances = Weka::Core::Instances.from_csv('weather.csv')
instances = Weka::Core::Instances.from_json('weather.json')

Creating Instances

Attributes of an Instances object can be defined in a block using the with_attributes method. The class attribute can be set by the class_attribute: true option on the fly with defining an attribute.

# create instances with relation name 'weather' and attributes
instances = Weka::Core::Instances.new(relation_name: 'weather').with_attributes do
  nominal :outlook, values: ['sunny', 'overcast', 'rainy']
  numeric :temperature
  numeric :humidity
  nominal :windy, values: [true, false]
  date    :last_storm, 'yyyy-MM-dd'
  nominal :play, values: [:yes, :no], class_attribute: true
end

You can also pass an array of Attributes on instantiating new Instances: This is useful, if you want to create a new empty Instances object with the same attributes as an already existing one:

# Take attributes from existing instances
attributes = instances.attributes

# create an empty Instances object with the given attributes
test_instances = Weka::Core::Instances.new(attributes: attributes)

Saving Instances as files

You can save Instances as ARFF, CSV, or JSON file.

instances.to_arff('weather.arff')
instances.to_csv('weather.csv')
instances.to_json('weather.json')

Adding additional attributes

You can add additional attributes to the Instances after its initialization. All records that are already in the dataset will get an unknown value (?) for the new attribute.

instances.add_numeric_attribute(:pressure)
instances.add_nominal_attribute(:grandma_says, values: [:hm, :bad, :terrible])
instances.add_date_attribute(:last_rain, 'yyyy-MM-dd HH:mm')

Adding a data instance

You can add a data instance to the Instances by using the add_instance method:

data = [:sunny, 70, 80, true, '2015-12-06', :yes, 1.1, :hm, '2015-12-24 20:00']
instances.add_instance(data)

# with custom weight:
instances.add_instance(data, weight: 2.0)

Multiple instances can be added with the add_instances method:

data = [
  [:sunny, 70, 80, true, '2015-12-06', :yes, 1.1, :hm, '2015-12-24 20:00'],
  [:overcast, 80, 85, false, '2015-11-11', :no, 0.9, :bad, '2015-12-25 18:13']
]

instances.add_instances(data, weight: 2.0)

If the weight argument is not given, then a default weight of 1.0 is used. The weight in add_instances is used for all the added instances.

Setting a class attribute

You can set an earlier defined attribute as the class attribute of the dataset. This allows classifiers to use the class for building a classification model while training.

instances.add_nominal_attribute(:size, values: ['L', 'XL'])
instances.class_attribute = :size

The added attribute can also be directly set as the class attribute:

instances.add_nominal_attribute(:size, values: ['L', 'XL'], class_attribute: true)

Keep in mind that you can only assign existing attributes to be the class attribute. The class attribute will not appear in the instances.attributes anymore and can be accessed with the class_attribute method.

Alias methods

Weka::Core::Instances has following alias methods:

method alias
numeric add_numeric_attribute
nominal add_nominal_attribute
date add_date_attribute
string add_string_attribute
set_class_attribute class_attribute=
with_attributes add_attributes

The methods on the left side are meant to be used when defining attributes in a block when using #with_attributes (or #add_attributes).

The alias methods are meant to be used for explicitly adding attributes to an Instances object or defining its class attribute later on.

Filters

Filters are used to preprocess datasets.

There are two categories of filters which are also reflected by the namespaces:

  • supervised – The filter requires a class atribute to be set
  • unsupervised – A class attribute is not required to be present

In each category there are two sub-categories:

  • attribute-based – Attributes (columns) are processed
  • instance-based – Instances (rows) are processed

Thus, Filter classes are organized in the following four namespaces:

Weka::Filters::Supervised::Attribute
Weka::Filters::Supervised::Instance

Weka::Filters::Unsupervised::Attribute
Weka::Filters::Unsupervised::Instance

Filtering Instances

Filters can be used directly to filter Instances:

# create filter
filter = Weka::Filters::Unsupervised::Attribute::Normalize.new

# filter instances
filtered_data = filter.filter(instances)

You can also apply a Filter on an Instances object:

# create filter
filter = Weka::Filters::Unsupervised::Attribute::Normalize.new

# apply filter on instances
filtered_data = instances.apply_filter(filter)

With this approach, it is possible to chain multiple filters on a dataset:

# create filters
include Weka::Filters::Unsupervised::Attribute

normalize  = Normalize.new
discretize = Discretize.new

# apply a filter chain on instances
filtered_data = instances.apply_filter(normalize).apply_filter(discretize)

Setting Filter options

Any Filter has several options. You can list a description of all options of a filter:

puts Weka::Filters::Unsupervised::Attribute::Normalize.options
# -S <num> The scaling factor for the output range.
#   (default: 1.0)
# -T <num>  The translation of the output range.
#   (default: 0.0)
# -unset-class-temporarily  Unsets the class index temporarily before the filter is
#   applied to the data.
#   (default: no)

To get the default option set of a Filter you can run .default_options:

Weka::Filters::Unsupervised::Attribute::Normalize.default_options
# => '-S 1.0 -T 0.0'

Options can be set while building a Filter:

filter = Weka::Filters::Unsupervised::Attribute::Normalize.build do
  use_options '-S 0.5'
end

Or they can be set or changed after you created the Filter:

filter = Weka::Filters::Unsupervised::Attribute::Normalize.new
filter.use_options('-S 0.5')

Attribute selection

Selecting attributes (features) from a set of instances is important for getting the best result out of a classification or clustering. Attribute selection reduces the number of attributes and thereby can speed up the runtime of the algorithms. It also avoids processing too many attributes when only a certain subset is essential for building a good model.

For attribute selection you need to apply a search and an evaluation method on a dataset.

Search methods are defined in the Weka::AttributeSelection::Search module. There are search methods for subset search and individual attribute search.

Evaluators are defined in the Weka::AttributeSelection::Evaluator module. Corresponding to search method types there are two evalutor types for subset search and individual search.

The search methods and evaluators from each category can be combined to perform an attribute selection.

Classes for attribute subset selection:

Search Evaluators
BestFirst, GreedyStepwise CfsSubset, WrapperSubset

Classes for individual attribute selection:

Search Evaluators
Ranker CorrelationAttribute, GainRatioAttribute, InfoGainAttribute, OneRAttribute, ReliefFAttribute, SymmetricalUncertAttribute

An attribute selection can either be performed with the Weka::AttributeSelection::AttributeSelection class:

instances = Weka::Core::Instances.from_arff('weather.arff')

selection           = Weka::AttributeSelection::AttributeSelection.new
selection.search    = Weka::AttributeSelection::Search::Ranker.new
selection.evaluator = Weka::AttributeSelection::Evaluator::PricipalComponents.new

selection.select_attribute(instances)
puts selection.summary

Or you can use the supervised AttributeSelection filter to directly filter instances:

instances = Weka::Core::Instances.from_arff('weather.arff')
search    = Weka::AttributeSelection::Search::Ranker.new
evaluator = Weka::AttributeSelection::Evaluator::PricipalComponents.new

filter = Weka::Filters::Supervised::Attribute::AttributeSelection.build do
  use_search    search
  use_evaluator evaluator
end

filtered_instances = instances.apply_filter(filter)

Classifiers

Weka‘s classification and regression algorithms can be found in the Weka::Classifiers namespace.

The classifier classes are organised in the following submodules:

Weka::Classifiers::Bayes
Weka::Classifiers::Functions
Weka::Classifiers::Lazy
Weka::Classifiers::Meta
Weka::Classifiers::Rules
Weka::Classifiers::Trees

Getting information about a classifier

To get a description about the classifier class and its available options you can use the class methods .description and .options on each classifier:

puts Weka::Classifiers::Trees::RandomForest.description
# Class for constructing a forest of random trees.
# For more information see:
# Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.

puts Weka::Classifiers::Trees::RandomForest.options
# -I <number of trees>  Number of trees to build.
#   (default 100)
# -K <number of features> Number of features to consider (<1=int(log_2(#predictors)+1)).
#   (default 0)
# ...

The default options that are used for a classifier can be displayed with:

Weka::Classifiers::Trees::RandomForest.default_options
# => "-I 100 -K 0 -S 1 -num-slots 1"

Creating a new classifier

To build a new classifiers model based on training instances you can use the following syntax:

instances = Weka::Core::Instances.from_arff('weather.arff')
instances.class_attribute = :play

classifier = Weka::Classifiers::Trees::RandomForest.new
classifier.use_options('-I 200 -K 5')
classifier.train_with_instances(instances)

You can also build a classifier by using the block syntax:

classifier = Weka::Classifiers::Trees::RandomForest.build do
  use_options '-I 200 -K 5'
  train_with_instances instances
end

Evaluating a classifier model

You can evaluate the trained classifier using cross-validation:

# default number of folds is 3
evaluation = classifier.cross_validate

# with a custom number of folds
evaluation = classifier.cross_validate(folds: 10)

The cross-validation returns a Weka::Classifiers::Evaluation object which can be used to get details about the accuracy of the trained classification model:

puts evaluation.summary
#
# Correctly Classified Instances          10               71.4286 %
# Incorrectly Classified Instances         4               28.5714 %
# Kappa statistic                          0.3778
# Mean absolute error                      0.4098
# Root mean squared error                  0.4657
# Relative absolute error                 87.4588 %
# Root relative squared error             96.2945 %
# Coverage of cases (0.95 level)         100      %
# Mean rel. region size (0.95 level)      96.4286 %
# Total Number of Instances               14

The evaluation holds detailed information about a number of different meassures of interest, like the precision and recall, the FP/FN/TP/TN-rates, F-Measure and the areas under PRC and ROC curve.

If your trained classifier should be evaluated against a set of test instances, you can use evaluate:

test_instances = Weka::Core::Instances.from_arff('test_data.arff')
test_instances.class_attribute = :play

evaluation = classifier.evaluate(test_instances)

Classifying new data

Each classifier implements either a classify method or a distibution_for method, or both.

The classify method takes a Weka::Core::DenseInstance or an Array of values as argument and returns the predicted class value:

instances = Weka::Core::Instances.from_arff('unclassified_data.arff')

# with an instance as argument
instances.map do |instance|
  classifier.classify(instance)
end
# => ['no', 'yes', 'yes', ...]

# with an Array of values as argument
classifier.classify [:sunny, 80, 80, :FALSE, '?']
# => 'yes'

The distribution_for method takes a Weka::Core::DenseInstance or an Array of values as argument as well and returns a hash with the distributions per class value:

instances = Weka::Core::Instances.from_arff('unclassified_data.arff')

# with an instance as argument
classifier.distribution_for(instances.first)
# => { "yes" => 0.26, "no" => 0.74 }

# with an Array of values as argument
classifier.distribution_for [:sunny, 80, 80, :FALSE, '?']
# => { "yes" => 0.62, "no" => 0.38 }

Clusterers

Clustering is an unsupervised machine learning technique which tries to find patterns in data and group sets of data. Clustering algorithms work without class attributes.

Weka‘s clustering algorithms can be found in the Weka::Clusterers namespace.

The following clusterer classes are available:

Weka::Clusterers::Canopy
Weka::Clusterers::Cobweb
Weka::Clusterers::EM
Weka::Clusterers::FarthestFirst
Weka::Clusterers::HierarchicalClusterer
Weka::Clusterers::SimpleKMeans

Getting information about a clusterer

To get a description about the clusterer class and its available options you can use the class methods .description and .options on each clusterer:

puts Weka::Clusterers::SimpleKMeans.description
# Cluster data using the k means algorithm.
# ...

puts Weka::Clusterers::SimpleKMeans.options
# -N <num>  Number of clusters.
#   (default 2).
# -init Initialization method to use.
#   0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first.
#   (default = 0)
# ...

The default options that are used for a clusterer can be displayed with:

Weka::Clusterers::SimpleKMeans.default_options
# => "-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25
#     -t2 -1.0 -N 2 -A weka.core.EuclideanDistance -R first-last -I 500 -num-slots 1 -S 10"

Creating a new Clusterer

To build a new clusterer model based on training instances you can use the following syntax:

instances = Weka::Core::Instances.from_arff('weather.arff')

clusterer = Weka::Clusterers::SimpleKMeans.new
clusterer.use_options('-N 3 -I 600')
clusterer.train_with_instances(instances)

You can also build a clusterer by using the block syntax:

classifier = Weka::Clusterers::SimpleKMeans.build do
  use_options '-N 5 -I 600'
  train_with_instances instances
end

Evaluating a clusterer model

You can evaluate trained density-based clusterer using cross-validation (The only density-based clusterer in the Weka lib is EM at the moment).

The cross-validation returns the cross-validated log-likelihood:

# default number of folds is 3
log_likelihood = clusterer.cross_validate
# => -10.556166997137497

# with a custom number of folds
log_likelihood = clusterer.cross_validate(folds: 10)
# => -10.262696653333032

If your trained classifier should be evaluated against a set of test instances, you can use evaluate. The evaluation returns a Weka::Clusterer::ClusterEvaluation object which can be used to get details about the accuracy of the trained clusterer model:

test_instances = Weka::Core::Instances.from_arff('test_data.arff')
evaluation     = clusterer.evaluate(test_instances)

puts evaluation.summary
# EM
# ==
#
# Number of clusters: 2
# Number of iterations performed: 7
#
#             Cluster
# Attribute           0       1
#                (0.35)  (0.65)
# ==============================
# outlook
#   sunny         3.8732  3.1268
#   overcast      1.7746  4.2254
#   rainy         2.1889  4.8111
#   [total]       7.8368 12.1632
# ...

Clustering new data

Similar to classifiers, clusterers come with a either a cluster method or a distribution_for method which both take a Weka::Core::DenseInstance or an Array of values as argument.

The classify method returns the index of the predicted cluster:

instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')

clusterer = Weka::Clusterers::Canopy.build
  train_with_instances instances
end

# with an instance as argument
instances.map do |instance|
  clusterer.cluster(instance)
end
# => [3, 3, 4, 0, 0, 1, 2, 3, 0, 0, 2, 2, 4, 1]

# with an Array of values as argument
clusterer.cluster [:sunny, 80, 80, :FALSE]
# => 4

The distribution_for method returns an Array with the distributions at the cluster‘s index:

# with an instance as argument
clusterer.distribution_for(instances.first)
# => [0.17229465277140552, 0.1675583309853506, 0.15089102301329346, 0.3274056122786787, 0.18185038095127165]

# with an Array of values as argument
classifier.distribution_for [:sunny, 80, 80, :FALSE]
# => [0.21517055355632506, 0.16012256401406233, 0.17890840384466453, 0.2202344150907843, 0.2255640634941639]

Adding a cluster attribute to a dataset

After building and training a clusterer with training instances you can use the clusterer in the unsupervised attribute filter AddCluster to assign a cluster to each instance of a dataset:

filter = Weka::Filter::Unsupervised::Attribute::AddCluster.new
filter.clusterer = clusterer

instances = Weka::Core::Instances.from_arff('unlabeled_data.arff')
clustered_instances = instances.apply_filter(filter)

puts clustered_instances.to_s

clustered_instance now has a nominal cluster attribute as the last attribute. The values of the cluster attribute are the N cluster names, e.g. with N = 2 clusters, the ARFF representation looks like:

...
@attribute outlook {sunny,overcast,rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute cluster {cluster1,cluster2}
...

Each instance is now assigned to a cluster, e.g.:

...
@data
sunny,85,85,FALSE,cluster1
sunny,80,90,TRUE,cluster1
...

Development

After checking out the repo, run bin/setup to install dependencies. To install this gem onto your local machine, run bundle exec rake install.

Then, run rake spec to run the tests. You can also run bin/console or rake irb for an interactive prompt that will allow you to experiment.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/paulgoetze/weka-jruby. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

For development we use the git branching model described by nvie.

Here's how to contribute:

  1. Fork it ( https://github.com/paulgoetze/weka-jruby/fork )
  2. Create your feature branch (git checkout -b feature/my-new-feature develop)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin feature/my-new-feature)
  5. Create a new Pull Request

Please try to add RSpec tests along with your new features. This will ensure that your code does not break existing functionality and that your feature is working as expected.

Acknowledgement

The original ideas for wrapping Weka in JRuby come from @arrigonialberto86 and his ruby-band gem. Great thanks!

License

The gem is available as open source under the terms of the MIT License.