Class: Ferret::Analysis::Analyzer

Inherits:
Object
  • Object
show all
Defined in:
ext/r_analysis.c

Overview

Summary

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.

The default Analyzer just creates a LowerCaseTokenizer which converts all text to lowercase tokens. See LowerCaseTokenizer for more details.

Example

To create your own custom Analyzer you simply need to implement a token_stream method which takes the field name and the data to be tokenized as parameters and returns a TokenStream. Most analyzers typically ignore the field name.

Here we’ll create a StemmingAnalyzer;

def MyAnalyzer < Analyzer
  def token_stream(field, str)
    return StemFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)))
  end
end