Class: Ferret::Index::IndexWriter

Inherits:
Object
  • Object
show all
Defined in:
ext/r_index.c

Overview

Summary

The IndexWriter is the class used to add documents to an index. You can also delete documents from the index using this class. The indexing process is highly customizable and the IndexWriter has the following parameters;

dir

This is an Ferret::Store::Directory object. You should either pass a :dir or a :path when creating an index.

path

A string representing the path to the index directory. If you are creating the index for the first time the directory will be created if it’s missing. You should not choose a directory which contains other files as they could be over-written. To protect against this set :create_if_missing to false.

create_if_missing

Default: true. Create the index if no index is found in the specified directory. Otherwise, use the existing index.

create

Default: false. Creates the index, even if one already exists. That means any existing index will be deleted. It is probably better to use the create_if_missing option so that the index is only created the first time when it doesn’t exist.

field_infos

Default FieldInfos.new. The FieldInfos object to use when creating a new index if :create_if_missing or :create is set to true. If an existing index is opened then this parameter is ignored.

analyzer

Default: Ferret::Analysis::StandardAnalyzer. Sets the default analyzer for the index. This is used by both the IndexWriter and the QueryParser to tokenize the input. The default is the StandardAnalyzer.

chunk_size

Default: 0x100000 or 1Mb. Memory performance tuning parameter. Sets the default size of chunks of memory malloced for use during indexing. You can usually leave this parameter as is.

max_buffer_memory

Default: 0x1000000 or 16Mb. Memory performance tuning parameter. Sets the amount of memory to be used by the indexing process. Set to a larger value to increase indexing speed. Note that this only includes memory used by the indexing process, not the rest of your ruby application.

term_index_interval

Default: 128. The skip interval between terms in the term dictionary. A smaller value will possibly increase search performance while also increasing memory usage and impacting negatively impacting indexing performance.

doc_skip_interval

Default: 16. The skip interval for document numbers in the index. As with :term_index_interval you have a trade-off. A smaller number may increase search performance while also increasing memory usage and impacting negatively impacting indexing performance.

merge_factor

Default: 10. This must never be less than 2. Specifies the number of segments of a certain size that must exist before they are merged. A larger value will improve indexing performance while slowing search performance.

max_buffered_docs

Default: 10000. The maximum number of documents that may be stored in memory before being written to the index. If you have a lot of memory and are indexing a large number of small documents (like products in a product database for example) you may want to set this to a much higher number (like Ferret::FIX_INT_MAX). If you are worried about your application crashing during the middle of index you might set this to a smaller number so that the index is committed more often. This is like having an auto-save in a word processor application.

max_merge_docs

Set this value to limit the number of documents that go into a single segment. Use this to avoid extremely long merge times during indexing which can make your application seem unresponsive. This is only necessary for very large indexes (millions of documents).

max_field_length

Default: 10000. The maximum number of terms added to a single field. This can be useful to protect the indexer when indexing documents from the web for example. Usually the most important terms will occur early on in a document so you can often safely ignore the terms in a field after a certain number of them. If you wanted to speed up indexing and same space in your index you may only want to index the first 1000 terms in a field. On the other hand, if you want to be more thorough and you are indexing documents from your file-system you may set this parameter to Ferret::FIX_INT_MAX.

use_compound_file

Default: true. Uses a compound file to store the index. This prevents an error being raised for having too many files open at the same time. The default is true but performance is better if this is set to false.

Deleting Documents

Both IndexReader and IndexWriter allow you to delete documents. You should use the IndexReader to delete documents by document id and IndexWriter to delete documents by term which we’ll explain now. It is preferrable to delete documents from an index using IndexWriter for performance reasons. To delete documents using the IndexWriter you should give each document in the index a unique ID. If you are indexing documents from the file-system this unique ID will be the full file path. If indexing documents from the database you should use the primary key as the ID field. You can then use the delete method to delete a file referenced by the ID. For example;

index_writer.delete(:id, "/path/to/indexed/file")