Class: Ferret::Analysis::RegExpTokenizer

Inherits:
Object
  • Object
show all
Defined in:
ext/r_analysis.c

Overview

Summary

A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.

Example

Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.

# of course you would add more than just é
RegExpTokenizer.new(input, /[[:alpha:]é]+/)

"Dave's résumé, at http://www.davebalmain.com/ 1234"
  => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]