DBClustering Build Status

Please note that this gem is still in its very early stages and should not considered stable. Also it currently only supports the in-memory datasource adapter. In future versions an ActiveRecord adapter is planned but this is not yet implemented. Stay tuned.

Requirements

Ruby 2.1+ is required, earlier Rubies may work but are not officially supported.

Getting Started

This gem was developed to work best in Ruby on Rails projects.

  1. Add this gem to your Gemfile

    gem 'db_clustering'
    
  2. Rund bundle install in your terminal

  3. Implement the clustering_vector method in your model class and return either:

    • an array with numeric values for similarity comparison
    • a hash with numeric values for similarity comparison between keys existing in both hashes

See TestModel class within the spec/support directory for a very simple example.

  1. Decide for a distance metric and initialize it, e.g.:
   average_difference = DbClustering::DistanceMetrics::AverageDifference.new

   # Instead you can also use one of the following:
   cosine_similarity = DbClustering::DistanceMetrics::CosineSimilarity.new
   euclidean_distance = DbClustering::DistanceMetrics::EuclideanDistance.new
   pearson_correlation = DbClustering::DistanceMetrics::PearsonCorrelation.new
  1. Decide for a datasource adapter (currently only in-memory datasource available), e.g.:
    in_memory_datasource = DbClustering::DatasourceAdapters::InMemory.new(array: your_array)

Please note that your_array should be an array filled with objects of the class type that implements the clustering_vector method from step 3.

An ActiveRecord datasource type is planned but not yet implemented. Please stay tuned.

  1. Decide for an algorithm and initialize it:
   dbscan = DbClustering::Algorithms::Dbscan.new(datasource: in_memory_datasource, distance_metric: average_difference)

Please note that currently only one algorithm is available. More algorithms aren't currently planned but may be added if needed. Contributions are welcome, of course.

  1. Decide for the algorithm parameters and start the process of clustering your data:
   dbscan.cluster(max_distance: 10, min_neighbors: 5)

The max_distance is the epsilon parameter and the min_neighbors the minPts parameter from the usual DBSCAN algorithm documentation (e.g. Wikipedia). You might want to try different values here first before you decide for the right values for your purpose.

Plase also take note that the max_distance value is highly dependent on the type of metric you decided to go for. For the AverageDifference and EuclideanDistance metrics it can be an open-ended positive value. For the CosineSimilarity and PearsonCorrelation types it needs to be a value between 0 and 2 where a value of 0 means "100% positive correlation/similarity", a value of 1 means "no correlation/similarity at all" and a value of 2 means "100% negative correlation/similarity". You can use any decimal value in between (e.g. 0.25) as a partly positive/negative correlation.

  1. Wait for the calculations to finish and use the results the way you want:
   clusters = dbscan.clusters # the resulting Clusters, each cluster contains Points
   first_cluster = clusters.first
   point = first_cluster.points.first
   # a point knows its cluster, and its position in there
   point.cluster # will return the same object as `first_cluster`
   point.is_edge_point? # boolean specifying if it's an edge point of its cluster
   point.is_core_point? # boolean specifying if it's a core point of its cluster
   point.is_noise_point? # boolean specifiying if it's a noise point without a cluster

   # a point also contains the source object specifying the `clustering_vector` method
   your_model = point.datasource_point

For more please don't hesitate to have a look into the underlying models under the lib/models directory as well as the corresponding specs.

That's it, it looks more complicated than it actually is, just try it out! You can find complete usage examples within the spec/algorithms/density_based/dbscan_spec.rb file.

Contributing

Contributions are welcome. Please fork this project, make your changes and file a pull request. Please also make sure to write tests to ensure your changes persist over time.

License

This gem is released under the MIT License.