This is a work in progress.
Introduction
This library is a Ruby extension, a wrapper around the Aho-Corasick implementation in C, found in Strmat package.
The source code (ac.c and ac.h) was “adapted” from Strmat. In fact, I’ve changed only 3-4 lines of code from the original implementation so it will feat my needs: search needed to return the current position in the searched string.
Okay, so what’s the idea?
Having a dictionary of known sentences (note: not words!), this kick ass algorithm can find individual patterns in an incoming stream of data. Kinda Fast.
The algorithm has 2 stages: one where an internal tree in being build from the given dictionary leaving the search to the second step.
Okay, so where can I use this?
Well, you can do some crazy things with it, like, you can lookup for DNA patterns or maybe analyze network sequences (read: strange and maybe proprietary network protocols), or domestic stuff like building contextual links on your blog posts to enrich your users experience.
Okay, so how can I install it?
Rubygems – Development Version
gem install aurelian-ruby-ahocorasick --source=http://gems.github.com
Build it from source
$ git clone git://github.com/aurelian/ruby-ahocorasick.git
$ cd ruby-ahocorasick
To build and install the gem on your machine (run with sudo if needed):
$ rake install
rake -T
will list other cool tasks.
Rubygems – Stable Version
There’s no stable version right now.
Notes
It’s known to work / compile / install on Ubuntu 8.04 and Mac OS 10.4.*. It should work out of the box if you have gcc around. Unfortunately I don’t have a Windows PC around nor required knowledge about Microsoft compliers.
Okay, so how do I use it?
require 'ahocorasick'
keyword_tree= AhoCorasick::KeywordTree.new # creates a new tree
keyword_tree.add_string( "foo-- Z@!bar" ) # add's a keyword to the tree
keyword_tree.add_string( "cervantes" ) # even more
results= keyword_tree.find_all( "1011000129 foo-- Z@!bar761 ! 001211 6xU" ).each do | result |
result[:value] # => "foo-- Z@!bar"
result[:starts_at] # => 11
result[:ends_at] # => 23
result[:id] # => 1
end
You can get some API reference on the wiki.
Bugs? Suggestions? Ideas? Patches?
For now, just use the email address.
Additional Reading
Other suffix – tree implementations:
License
© 2008 – Aurelian Oancea, < oancea at gmail dot com >
released under MIT-LICENCE