Class: Ferret::Analysis::RegExpTokenizer
- Inherits:
-
Tokenizer
- Object
- TokenStream
- Tokenizer
- Ferret::Analysis::RegExpTokenizer
- Defined in:
- lib/ferret/analysis/tokenizers.rb
Overview
An abstract base class for simple regular expression oriented tokenizers. Very powerful tokenizers can be created using this class as can be seen from the StandardTokenizer class. Bellow is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.
class LetterTokenizer < RegExpTokenizer
def token_re()
/[[:alpha:]]+/
end
end
Direct Known Subclasses
Instance Method Summary collapse
- #close ⇒ Object
-
#initialize(input) ⇒ RegExpTokenizer
constructor
Initialize with an IO implementing input such as a file.
-
#next ⇒ Object
Returns the next token in the stream, or null at EOS.
Methods inherited from TokenStream
Constructor Details
#initialize(input) ⇒ RegExpTokenizer
Initialize with an IO implementing input such as a file.
- input
-
must have a read(count) method which returns an array or string of count chars.
38 39 40 41 42 43 44 |
# File 'lib/ferret/analysis/tokenizers.rb', line 38 def initialize(input) if input.is_a? String @ss = StringScanner.new(input) else @ss = StringScanner.new(input.read()) end end |
Instance Method Details
#close ⇒ Object
59 60 61 |
# File 'lib/ferret/analysis/tokenizers.rb', line 59 def close() @ss = nil end |
#next ⇒ Object
Returns the next token in the stream, or null at EOS.
47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/ferret/analysis/tokenizers.rb', line 47 def next() if @ss.scan_until(token_re) term = @ss.matched term_end = @ss.pos term_start = term_end - term.size else return nil end return Token.new(normalize(term), term_start, term_end) end |