Classifier

Gem Version CI License: LGPL

A Ruby library for text classification using Bayesian and Latent Semantic Indexing (LSI) algorithms.

Table of Contents

Installation

Add to your Gemfile:

gem 'classifier'

Then run:

bundle install

Or install directly:

gem install classifier

Optional: GSL for Faster LSI

For significantly faster LSI operations, install the GNU Scientific Library.

Ruby 3+ The released `gsl` gem doesn't support Ruby 3+. Install from source: ```bash # Install GSL library brew install gsl # macOS apt-get install libgsl-dev # Ubuntu/Debian # Build and install the gem git clone https://github.com/cardmagic/rb-gsl.git cd rb-gsl git checkout fix/ruby-3.4-compatibility gem build gsl.gemspec gem install gsl-*.gem ```
Ruby 2.x ```bash # macOS brew install gsl gem install gsl # Ubuntu/Debian apt-get install libgsl-dev gem install gsl ```

When GSL is installed, Classifier automatically uses it. To suppress the GSL notice:

SUPPRESS_GSL_WARNING=true ruby your_script.rb

Compatibility

Ruby Version Status
4.0 Supported
3.4 Supported
3.3 Supported
3.2 Supported
3.1 EOL (unsupported)

Bayesian Classifier

Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.

Quick Start

require 'classifier'

classifier = Classifier::Bayes.new('Spam', 'Ham')

# Train the classifier
classifier.train_spam "Buy cheap viagra now! Limited offer!"
classifier.train_spam "You've won a million dollars! Claim now!"
classifier.train_ham "Meeting scheduled for tomorrow at 10am"
classifier.train_ham "Please review the attached document"

# Classify new text
classifier.classify "Congratulations! You've won a prize!"
# => "Spam"

Persistence with Madeleine

require 'classifier'
require 'madeleine'

m = SnapshotMadeleine.new("classifier_data") {
  Classifier::Bayes.new('Interesting', 'Uninteresting')
}

m.system.train_interesting "fascinating article about science"
m.system.train_uninteresting "boring repetitive content"
m.take_snapshot

# Later, restore and use:
m.system.classify "new scientific discovery"
# => "Interesting"

Learn More

LSI (Latent Semantic Indexing)

Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.

Quick Start

require 'classifier'

lsi = Classifier::LSI.new

# Add documents with categories
lsi.add_item "Dogs are loyal pets that love to play fetch", :pets
lsi.add_item "Cats are independent and love to nap", :pets
lsi.add_item "Ruby is a dynamic programming language", :programming
lsi.add_item "Python is great for data science", :programming

# Classify new text
lsi.classify "My puppy loves to run around"
# => :pets

# Get classification with confidence score
lsi.classify_with_confidence "Learning to code in Ruby"
# => [:programming, 0.89]

Search and Discovery

# Find similar documents
lsi.find_related "Dogs are great companions", 2
# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]

# Search by keyword
lsi.search "programming", 3
# => ["Ruby is a dynamic programming language", "Python is great for..."]

Learn More

Performance

GSL vs Native Ruby

GSL provides dramatic speedups for LSI operations, especially build_index (SVD computation):

Documents build_index Overall
5 4x faster 2.5x
10 24x faster 5.5x
15 116x faster 17x
Detailed benchmark (15 documents) ``` Operation Native GSL Speedup ---------------------------------------------------------- build_index 0.1412 0.0012 116.2x classify 0.0142 0.0049 2.9x search 0.0102 0.0026 3.9x find_related 0.0069 0.0016 4.2x ---------------------------------------------------------- TOTAL 0.1725 0.0104 16.6x ```

Running Benchmarks

rake benchmark              # Run with current configuration
rake benchmark:compare      # Compare GSL vs native Ruby

Development

Setup

git clone https://github.com/cardmagic/classifier.git
cd classifier
bundle install

Running Tests

rake test                        # Run all tests
ruby -Ilib test/bayes/bayesian_test.rb  # Run specific test file

# Test without GSL (pure Ruby)
NATIVE_VECTOR=true rake test

Console

rake console

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -am 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Authors

License

This library is released under the GNU Lesser General Public License (LGPL) 2.1.