Class: Ferret::Analysis::StandardTokenizer

Inherits:
RegExpTokenizer show all
Defined in:
lib/ferret/analysis/standard_tokenizer.rb

Overview

The standard tokenizer is an advanced tokenizer which tokenizes morst words correctly as well as tokenizing things like email addresses, web addresses, phone numbers, etc.

Constant Summary collapse

ALPHA =
/[[:alpha:]]+/
APOSTROPHE =
/#{ALPHA}('#{ALPHA})+/
ACRONYM =
/#{ALPHA}\.(#{ALPHA}\.)+/
P =
/[_\/.,-]/
HASDIGIT =
/\w*\d\w*/
TOKEN_RE =
/[[:alpha:]]+(('[[:alpha:]]+)+
             |\.([[:alpha:]]\.)+
             |(@|\&)\w+([-.]\w+)*
             )
|\w+(([\-._]\w+)*\@\w+([-.]\w+)+
    |#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)?
    |(\.\w+)+
    |
    )
/x
ACRONYM_WORD =
/^#{ACRONYM}$/
APOSTROPHE_WORD =
/^#{APOSTROPHE}$/
DOT =
/\./
APOSTROPHE_S =
/'[sS]$/

Method Summary

Methods inherited from RegExpTokenizer

#close, #initialize, #next

Methods inherited from Tokenizer

#close

Methods inherited from TokenStream

#close, #each, #next

Constructor Details

This class inherits a constructor from Ferret::Analysis::RegExpTokenizer