Module: Ferret::Analysis

Defined in:
ext/r_analysis.c,
ext/r_analysis.c

Overview

Summary

The Analysis module contains all the classes used to analyze and tokenize the data to be indexed. There are three main classes you need to know about when dealing with analysis; Analyzer, TokenStream and Token.

Classes

Analyzer

Analyzers handle all of your tokenizing needs. You pass an Analyzer to the indexing class when you create it and it will create the TokenStreams necessary to tokenize the fields in the documents. Most of the time you won’t need to worry about TokenStreams and Tokens, one of the Analyzers distributed with Ferret will do exactly what you need. Otherwise you’ll need to implement a custom analyzer.

TokenStream

A TokenStream is an enumeration of Tokens. There are two standard types of TokenStream; Tokenizer and TokenFilter. A Tokenizer takes a String and turns it into a list of Tokens. A TokenFilter takes another TokenStream and post-processes the Tokens. You can chain as many TokenFilters together as you like but they always need to finish with a Tokenizer.

Token

A Token is a single term from a document field. A token contains the text representing the term as well as the start and end offset of the token. The start and end offset will represent the token as it appears in the source field. Some TokenFilters may change the text in the Token but the start and end offsets should stay the same so (end - start) won’t necessarily be equal to the length of text in the token. For example using a stemming TokenFilter the term “Beginning” might have start and end offsets of 10 and 19 respectively (“Beginning”.length == 9) but Token#text might be “begin” (after stemming).

Constant Summary collapse

ENGLISH_STOP_WORDS =
get_rstopwords(ENGLISH_STOP_WORDS)
FULL_ENGLISH_STOP_WORDS =
get_rstopwords(FULL_ENGLISH_STOP_WORDS)
EXTENDED_ENGLISH_STOP_WORDS =
get_rstopwords(EXTENDED_ENGLISH_STOP_WORDS)
FULL_FRENCH_STOP_WORDS =
get_rstopwords(FULL_FRENCH_STOP_WORDS)
FULL_SPANISH_STOP_WORDS =
get_rstopwords(FULL_SPANISH_STOP_WORDS)
FULL_PORTUGUESE_STOP_WORDS =
get_rstopwords(FULL_PORTUGUESE_STOP_WORDS)
FULL_ITALIAN_STOP_WORDS =
get_rstopwords(FULL_ITALIAN_STOP_WORDS)
FULL_GERMAN_STOP_WORDS =
get_rstopwords(FULL_GERMAN_STOP_WORDS)
FULL_DUTCH_STOP_WORDS =
get_rstopwords(FULL_DUTCH_STOP_WORDS)
FULL_SWEDISH_STOP_WORDS =
get_rstopwords(FULL_SWEDISH_STOP_WORDS)
FULL_NORWEGIAN_STOP_WORDS =
get_rstopwords(FULL_NORWEGIAN_STOP_WORDS)
FULL_DANISH_STOP_WORDS =
get_rstopwords(FULL_DANISH_STOP_WORDS)
FULL_RUSSIAN_STOP_WORDS =
get_rstopwords(FULL_RUSSIAN_STOP_WORDS)
FULL_FINNISH_STOP_WORDS =
get_rstopwords(FULL_FINNISH_STOP_WORDS)
FULL_HUNGARIAN_STOP_WORDS =
get_rstopwords(FULL_HUNGARIAN_STOP_WORDS)