Class: Ferret::Analysis::StandardTokenizer
- Inherits:
-
RegExpTokenizer
- Object
- TokenStream
- Tokenizer
- RegExpTokenizer
- Ferret::Analysis::StandardTokenizer
- Defined in:
- lib/ferret/analysis/standard_tokenizer.rb
Overview
The standard tokenizer is an advanced tokenizer which tokenizes morst words correctly as well as tokenizing things like email addresses, web addresses, phone numbers, etc.
Constant Summary collapse
- ALPHA =
/[[:alpha:]]+/- APOSTROPHE =
/#{ALPHA}('#{ALPHA})+/- ACRONYM =
/#{ALPHA}\.(#{ALPHA}\.)+/- P =
/[_\/.,-]/- HASDIGIT =
/\w*\d\w*/- TOKEN_RE =
/[[:alpha:]]+(('[[:alpha:]]+)+ |\.([[:alpha:]]\.)+ |(@|\&)\w+([-.]\w+)* ) |\w+(([\-._]\w+)*\@\w+([-.]\w+)+ |#{P}#{HASDIGIT}(#{P}\w+#{P}#{HASDIGIT})*(#{P}\w+)? |(\.\w+)+ | ) /x- ACRONYM_WORD =
/^#{ACRONYM}$/- APOSTROPHE_WORD =
/^#{APOSTROPHE}$/- DOT =
/\./- APOSTROPHE_S =
/'[sS]$/
Method Summary
Methods inherited from RegExpTokenizer
Methods inherited from Tokenizer
Methods inherited from TokenStream
Constructor Details
This class inherits a constructor from Ferret::Analysis::RegExpTokenizer