Class: Ferret::Analysis::RegExpTokenizer

Inherits:
Tokenizer show all
Defined in:
lib/ferret/analysis/tokenizers.rb

Overview

An abstract base class for simple regular expression oriented tokenizers. Very powerful tokenizers can be created using this class as can be seen from the StandardTokenizer class. Bellow is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.

class LetterTokenizer < RegExpTokenizer
    def token_re()
      /[[:alpha:]]+/
    end
end

Instance Method Summary collapse

Methods inherited from TokenStream

#each

Constructor Details

#initialize(input) ⇒ RegExpTokenizer

Initialize with an IO implementing input such as a file.

input

must have a read(count) method which returns an array or string of count chars.



38
39
40
41
42
43
44
# File 'lib/ferret/analysis/tokenizers.rb', line 38

def initialize(input)
  if input.is_a? String
    @ss = StringScanner.new(input)
  else
    @ss = StringScanner.new(input.read())
  end
end

Instance Method Details

#closeObject



59
60
61
# File 'lib/ferret/analysis/tokenizers.rb', line 59

def close()
  @ss = nil
end

#nextObject

Returns the next token in the stream, or null at EOS.



47
48
49
50
51
52
53
54
55
56
57
# File 'lib/ferret/analysis/tokenizers.rb', line 47

def next()
  if @ss.scan_until(token_re)
    term = @ss.matched
    term_end = @ss.pos
    term_start = term_end - term.size
  else
    return nil
  end

  return Token.new(normalize(term), term_start, term_end)
end