Class: Ferret::Index::IndexReader

Inherits:

Object

Object
Ferret::Index::IndexReader

show all

Includes:: MonitorMixin

Defined in:: lib/ferret/index/index_reader.rb

Overview

IndexReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, class which implements it is searchable.

Concrete subclasses of IndexReader are usually constructed with a call to one of the static open() methods, e.g. #open.

For efficiency, in this API documents are often referred to via _document numbers_, non-negative integers which each name a unique document in the index. These document numbers are ephemeral, ie they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.

An IndexReader can be opened on a directory for which an IndexWriter is opened already, but it cannot be used to delete documents from the index then.

Direct Known Subclasses

MultiReader, SegmentReader

Defined Under Namespace

Classes: FieldOption

Constant Summary collapse

FILENAME_EXTENSIONS = This array contains all filename extensions used by Lucene’s index files, with one exception, namely the extension made up from .f + a number. Also note that two of Lucene’s files (deletable and segments) don’t have any filename extension.

["cfs",
"fnm",
"fdx",
"fdt",
"tii",
"tis",
"frq",
"prx",
"del",
"tvx",
"tvd",
"tvf",
"tvp"]

Instance Attribute Summary collapse

#directory ⇒ Object readonly

Returns the value of attribute directory.

Class Method Summary collapse

.get_current_version(directory) ⇒ Object

Reads version number from segments files.
.index_exists?(directory) ⇒ Boolean

Returns true if an index exists at the specified directory.
.open(directory, close_directory = true, infos = nil) ⇒ Object

Returns an index reader to read the index in the directory.

Instance Method Summary collapse

#acquire_write_lock ⇒ Object

Tries to acquire the WriteLock on this directory.
#close ⇒ Object

Closes files associated with this index.
#commit ⇒ Object

Commit changes resulting from delete, undelete_all, or set_norm operations.
#delete(doc_num) ⇒ Object

Deletes the document numbered doc_num.
#delete_docs_with_term(term) ⇒ Object

Deletes all documents containing term.
#deleted?(n) ⇒ Boolean

Returns true if document n has been deleted.
#do_delete(doc_num) ⇒ Object

Implements deletion of the document numbered doc_num.
#do_set_norm(doc, field, value) ⇒ Object

Implements set_norm in subclass.
#doc_freq(t) ⇒ Object

Returns the number of documents containing the term t.
#get_document(n) ⇒ Object

Returns the stored fields of the n<sup>th</sup> Document in this index.
#get_document_with_term(term) ⇒ Object

Returns the first document with the term term.
#get_norms(field, bytes = nil, offset = nil) ⇒ Object

Returns the byte-encoded normalization factor for the named field of every document.
#get_term_vector(doc_number, field) ⇒ Object

Return a term vector for the specified document and field.
#get_term_vectors(doc_number) ⇒ Object

Return an array of term vectors for the specified document.
#has_deletions? ⇒ Boolean

Returns true if any documents have been deleted.
#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ IndexReader constructor

To create an IndexReader use the IndexReader.open method.
#latest? ⇒ Boolean

Returns true if the reader is reading from the latest version of the index.
#max_doc ⇒ Object

Returns one greater than the largest possible document number.
#num_docs ⇒ Object

Returns the number of documents in this index.
#set_norm(doc, field, value) ⇒ Object

Expert: Resets the normalization factor for the named field of the named document.
#term_docs ⇒ Object

Returns an unpositioned TermDocEnum enumerator.
#term_docs_for(term) ⇒ Object

Returns an enumeration of all the documents which contain term.
#term_positions ⇒ Object

Returns an unpositioned @link TermDocPosEnumendenumerator.
#term_positions_for(term) ⇒ Object

Returns an enumeration of all the documents which contain term.
#terms ⇒ Object

Returns an enumeration of all the terms in the index.
#terms_from(t) ⇒ Object

Returns an enumeration of all terms after a given term.
#undelete_all ⇒ Object

Undeletes all documents currently marked as deleted in this index.

Constructor Details

#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ `IndexReader`

To create an IndexReader use the IndexReader.open method. This method should only be used by subclasses.

directory: Directory where IndexReader files reside.
segment_infos: Used for write-l
close_directory: close the directory when the index reader is closed

# File 'lib/ferret/index/index_reader.rb', line 71

def initialize(directory, segment_infos = nil,
               close_directory = false, directory_owner = false)
  super()
  @directory = directory
  @close_directory = close_directory
  @segment_infos = segment_infos
  @directory_owner = directory_owner

  @has_changes = false
  @stale = false
  @write_lock = nil

  #ObjectSpace.define_finalizer(self, lambda { |id| @write_lock.release() if @write_lock})
end

Instance Attribute Details

#directory ⇒ `Object` (readonly)

Returns the value of attribute directory.



40
41
42

# File 'lib/ferret/index/index_reader.rb', line 40

def directory
  @directory
end

Class Method Details

.get_current_version(directory) ⇒ `Object`

Reads version number from segments files. The version number counts the number of changes of the index.

directory: where the index resides.
returns: version number.
raises: IOError if segments file cannot be read.



130
131
132

# File 'lib/ferret/index/index_reader.rb', line 130

def IndexReader.get_current_version(directory)
  return SegmentInfos.read_current_version(directory)
end

.index_exists?(directory) ⇒ `Boolean`

Returns true if an index exists at the specified directory. If the directory does not exist or if there is no index in it.

directory: the directory to check for an index
returns: true if an index exists; false otherwise
raises: IOError if there is a problem with accessing the index

Returns:

(Boolean)



176
177
178

# File 'lib/ferret/index/index_reader.rb', line 176

def IndexReader.index_exists?(directory)
  return directory.exists?("segments")
end

.open(directory, close_directory = true, infos = nil) ⇒ `Object`

Returns an index reader to read the index in the directory

directory: This can either be a Directory object or you can pass nil (RamDirectory is created) or a path (FSDirectory is created). If you chose the second or third options, you should leave close_directory as true and infos as nil.
close_directory: True if you want the IndexReader to close the directory when the IndexReader is closed. You’ll want to set this to false if other objects are using the same directory object.
infos: Expert: This can be used to read an different version of the index but should really be left alone.

# File 'lib/ferret/index/index_reader.rb', line 99

def IndexReader.open(directory, close_directory = true, infos = nil)
  if directory.nil?
    directory = Ferret::Store::RAMDirectory.new
  elsif directory.is_a?(String)
    directory = Ferret::Store::FSDirectory.new(directory, true)
  end
  directory.synchronize do # in- & inter-process sync
    commit_lock = directory.make_lock(IndexWriter::COMMIT_LOCK_NAME)
    commit_lock.while_locked() do
      if infos.nil?
        infos = SegmentInfos.new()
        infos.read(directory)
      end
      if (infos.size() == 1) # index is optimized
        return SegmentReader.get(infos[0], infos, close_directory)
      end
      readers = Array.new(infos.size)
      infos.size.times do |i|
        readers[i] = SegmentReader.get(infos[i])
      end
      return MultiReader.new(readers, directory, infos, close_directory)
    end
  end
end

Instance Method Details

#acquire_write_lock ⇒ `Object`

Tries to acquire the WriteLock on this directory.

This method is only valid if this IndexReader is directory owner.

raises: IOError If WriteLock cannot be acquired.

# File 'lib/ferret/index/index_reader.rb', line 325

def acquire_write_lock()
  if @stale
    raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations"
  end

  if (@write_lock == nil) 
    @write_lock = @directory.make_lock(IndexWriter::WRITE_LOCK_NAME)
    if not @write_lock.obtain(IndexWriter::WRITE_LOCK_TIMEOUT) # obtain write lock
      raise IOError, "Index locked for write: " + @write_lock
    end

    # we have to check whether index has changed since this reader was opened.
    # if so, this reader is no longer valid for deletion
    if (SegmentInfos.read_current_version(@directory) > @segment_infos.version()) 
      @stale = true
      @write_lock.release()
      @write_lock = nil
      raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations"
    end
  end
end

#close ⇒ `Object`

Closes files associated with this index. Also saves any new deletions to disk. No other methods should be called after this has been called.

# File 'lib/ferret/index/index_reader.rb', line 433

def close()
  synchronize do
    commit()
    do_close()
    @directory.close() if @close_directory
  end
end

#commit ⇒ `Object`

Commit changes resulting from delete, undelete_all, or set_norm operations

raises: IOError

# File 'lib/ferret/index/index_reader.rb', line 407

def commit()
  synchronize do
    if @has_changes
      if @directory_owner
        @directory.synchronize do # in- & inter-process sync
          commit_lock = @directory.make_lock(IndexWriter::COMMIT_LOCK_NAME)
          commit_lock.while_locked do
            do_commit()
            @segment_infos.write(@directory)
          end
        end
        if (@write_lock != nil) 
          @write_lock.release()  # release write lock
          @write_lock = nil
        end
      else
        do_commit()
      end
    end
    @has_changes = false
  end
end

#delete(doc_num) ⇒ `Object`

Deletes the document numbered doc_num. Once a document is deleted it will not appear in TermDocEnum or TermPostitions enumerations. Attempts to read its field with the @link #documentend method will result in an error. The presence of this document may still be reflected in the @link #docFreqendstatistic, though this will be corrected eventually as the index is further modified.

# File 'lib/ferret/index/index_reader.rb', line 359

def delete(doc_num)
  synchronize do
    acquire_write_lock() if @directory_owner
    do_delete(doc_num)
    @has_changes = true
  end
  return 1
end

#delete_docs_with_term(term) ⇒ `Object`

Deletes all documents containing term. This is useful if one uses a document field to hold a unique ID string for the document. Then to delete such a document, one merely constructs a term with the appropriate field and the unique ID string as its text and passes it to this method. Returns the number of documents deleted. See #delete for information about when this deletion will become effective.

# File 'lib/ferret/index/index_reader.rb', line 380

def delete_docs_with_term(term)
  docs = term_docs_for(term)
  if (docs == nil) then return 0 end
  n = 0
  begin 
    while (docs.next?) 
      delete(docs.doc)
      n += 1
    end
  ensure 
    docs.close()
  end
  return n
end

#deleted?(n) ⇒ `Boolean`

Returns true if document n has been deleted

Returns:

(Boolean)

Raises:

(NotImplementedError)



218
219
220

# File 'lib/ferret/index/index_reader.rb', line 218

def deleted?(n)
  raise NotImplementedError
end

#do_delete(doc_num) ⇒ `Object`

Implements deletion of the document numbered doc_num. Applications should call @link #delete(int)endor @link #delete(Term)end.

Raises:

(NotImplementedError)



370
371
372

# File 'lib/ferret/index/index_reader.rb', line 370

def do_delete(doc_num)
  raise NotImplementedError
end

#do_set_norm(doc, field, value) ⇒ `Object`

Implements set_norm in subclass.

Raises:

(NotImplementedError)



255
256
257

# File 'lib/ferret/index/index_reader.rb', line 255

def do_set_norm(doc, field, value) 
  raise NotImplementedError
end

#doc_freq(t) ⇒ `Object`

Returns the number of documents containing the term t.

Raises:

(NotImplementedError)



273
274
275

# File 'lib/ferret/index/index_reader.rb', line 273

def doc_freq(t)
  raise NotImplementedError
end

#get_document(n) ⇒ `Object`

Returns the stored fields of the n<sup>th</sup> Document in this index.

Raises:

(NotImplementedError)



195
196
197

# File 'lib/ferret/index/index_reader.rb', line 195

def get_document(n)
  raise NotImplementedError
end

#get_document_with_term(term) ⇒ `Object`

Returns the first document with the term term. This is useful, for example, if we are indexing rows from a database. We can store the id of each row in a field in the index and use this method get the document by the id. Hence, only one document is returned.

term: The term we are searching for.

# File 'lib/ferret/index/index_reader.rb', line 205

def get_document_with_term(term)
  docs = term_docs_for(term)
  if (docs == nil) then return nil end
  document = nil
  begin 
    document = get_document(docs.doc) if docs.next?
  ensure 
    docs.close()
  end
  return document
end

#get_norms(field, bytes = nil, offset = nil) ⇒ `Object`

Returns the byte-encoded normalization factor for the named field of every document. This is used by the search code to score documents.

See Field#boost

Raises:

(NotImplementedError)



231
232
233

# File 'lib/ferret/index/index_reader.rb', line 231

def get_norms(field, bytes=nil, offset=nil)
  raise NotImplementedError
end

#get_term_vector(doc_number, field) ⇒ `Object`

Return a term vector for the specified document and field. The returned vector contains terms and frequencies for the terms in the specified field of this document, if the field had the storeTermVector flag set. If termvectors had been stored with positions or offsets, a TermDocPosEnumVector is returned.

doc_number: document for which the term vector is returned
field: field for which the term vector is returned.
returns: term vector May be nil if field does not exist in the specified document or term vector was not stored.
raises: IOError if index cannot be accessed

See Field.TermVector

Raises:

(NotImplementedError)



165
166
167

# File 'lib/ferret/index/index_reader.rb', line 165

def get_term_vector(doc_number, field)
  raise NotImplementedError
end

#get_term_vectors(doc_number) ⇒ `Object`

Return an array of term vectors for the specified document. The array contains a vector for each vectorized field in the document. Each vector contains terms and frequencies for all terms in a given vectorized field. If no such fields existed, the method returns nil. The term vectors that are returned my either be of type TermFreqVector or of type TermDocPosEnumVector if positions or offsets have been stored.

doc_number: document for which term vectors are returned
returns: array of term vectors. May be nil if no term vectors have been stored for the specified document.
raises: IOError if index cannot be accessed

See Field.TermVector

Raises:

(NotImplementedError)



147
148
149

# File 'lib/ferret/index/index_reader.rb', line 147

def get_term_vectors(doc_number)
  raise NotImplementedError
end

#has_deletions? ⇒ `Boolean`

Returns true if any documents have been deleted

Returns:

(Boolean)

Raises:

(NotImplementedError)



223
224
225

# File 'lib/ferret/index/index_reader.rb', line 223

def has_deletions?()
  raise NotImplementedError
end

#latest? ⇒ `Boolean`

Returns true if the reader is reading from the latest version of the index.

Returns:

(Boolean)



349
350
351

# File 'lib/ferret/index/index_reader.rb', line 349

def latest?()
  SegmentInfos.read_current_version(@directory) == @segment_infos.version()
end

#max_doc ⇒ `Object`

Returns one greater than the largest possible document number.

This may be used to, e.g., determine how big to allocate an array which will have an element for every document number in an index.

Raises:

(NotImplementedError)



189
190
191

# File 'lib/ferret/index/index_reader.rb', line 189

def max_doc()
  raise NotImplementedError
end

#num_docs ⇒ `Object`

Returns the number of documents in this index.

Raises:

(NotImplementedError)



181
182
183

# File 'lib/ferret/index/index_reader.rb', line 181

def num_docs()
  raise NotImplementedError
end

#set_norm(doc, field, value) ⇒ `Object`

Expert: Resets the normalization factor for the named field of the named document. The norm represents the product of the field’s Field#boost and its Similarity#length_norm length normalization. Thus, to preserve the length normalization values when resetting this, one should base the new value upon the old.

See #get_norms See Similarity#decode_norm

# File 'lib/ferret/index/index_reader.rb', line 243

def set_norm(doc, field, value)
  synchronize do
    value = Similarity.encode_norm(value) if value.is_a? Float
    if(@directory_owner)
      acquire_write_lock()
    end
    do_set_norm(doc, field, value)
    @has_changes = true
  end
end

#term_docs ⇒ `Object`

Returns an unpositioned TermDocEnum enumerator.

Raises:

(NotImplementedError)



293
294
295

# File 'lib/ferret/index/index_reader.rb', line 293

def term_docs()
  raise NotImplementedError
end

#term_docs_for(term) ⇒ `Object`

Returns an enumeration of all the documents which contain term. For each document, the document number, the frequency of the term in that document is also provided, for use in search scoring. Thus, this method implements the mapping:

Term => <doc_num, freq><sup>*</sup>

The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.

# File 'lib/ferret/index/index_reader.rb', line 286

def term_docs_for(term)
  term_docs = term_docs()
  term_docs.seek(term)
  return term_docs
end

#term_positions ⇒ `Object`

Returns an unpositioned @link TermDocPosEnumendenumerator.

Raises:

(NotImplementedError)



316
317
318

# File 'lib/ferret/index/index_reader.rb', line 316

def term_positions()
  raise NotImplementedError
end

#term_positions_for(term) ⇒ `Object`

Returns an enumeration of all the documents which contain term. For each document, in addition to the document number and frequency of the term in that document, a list of all of the ordinal positions of the term in the document is available. Thus, this method implements the mapping:

Term => <doc_num, freq, < pos<sub>1</sub>, pos<sub>2</sub>, ...
pos<sub>freq-1</sub> > > <sup>*</sup>

This positional information faciliates phrase and proximity searching. The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.

# File 'lib/ferret/index/index_reader.rb', line 309

def term_positions_for(term)
  term_positions = term_positions()
  term_positions.seek(term)
  return term_positions
end

#terms ⇒ `Object`

Returns an enumeration of all the terms in the index. Each term is greater than all that precede it in the enumeration.

Raises:

(NotImplementedError)



261
262
263

# File 'lib/ferret/index/index_reader.rb', line 261

def terms()
  raise NotImplementedError
end

#terms_from(t) ⇒ `Object`

Returns an enumeration of all terms after a given term.

Each term is greater than all that precede it in the enumeration.

Raises:

(NotImplementedError)



268
269
270

# File 'lib/ferret/index/index_reader.rb', line 268

def terms_from(t)
  raise NotImplementedError
end

#undelete_all ⇒ `Object`

Undeletes all documents currently marked as deleted in this index.

# File 'lib/ferret/index/index_reader.rb', line 396

def undelete_all()
  synchronize do
    acquire_write_lock() if @directory_owner
    do_undelete_all()
    @has_changes = true
  end
end

Class: Ferret::Index::IndexReader

Overview

Direct Known Subclasses

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ IndexReader

Instance Attribute Details

#directory ⇒ Object (readonly)

Class Method Details

.get_current_version(directory) ⇒ Object

.index_exists?(directory) ⇒ Boolean

.open(directory, close_directory = true, infos = nil) ⇒ Object

Instance Method Details

#acquire_write_lock ⇒ Object

#close ⇒ Object

#commit ⇒ Object

#delete(doc_num) ⇒ Object

#delete_docs_with_term(term) ⇒ Object

#deleted?(n) ⇒ Boolean

#do_delete(doc_num) ⇒ Object

#do_set_norm(doc, field, value) ⇒ Object

#doc_freq(t) ⇒ Object

#get_document(n) ⇒ Object

#get_document_with_term(term) ⇒ Object

#get_norms(field, bytes = nil, offset = nil) ⇒ Object

#get_term_vector(doc_number, field) ⇒ Object

#get_term_vectors(doc_number) ⇒ Object

#has_deletions? ⇒ Boolean

#latest? ⇒ Boolean

#max_doc ⇒ Object

#num_docs ⇒ Object

#set_norm(doc, field, value) ⇒ Object

#term_docs ⇒ Object

#term_docs_for(term) ⇒ Object

#term_positions ⇒ Object

#term_positions_for(term) ⇒ Object

#terms ⇒ Object

#terms_from(t) ⇒ Object

#undelete_all ⇒ Object

#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ `IndexReader`

#directory ⇒ `Object` (readonly)

.get_current_version(directory) ⇒ `Object`

.index_exists?(directory) ⇒ `Boolean`

.open(directory, close_directory = true, infos = nil) ⇒ `Object`

#acquire_write_lock ⇒ `Object`

#close ⇒ `Object`

#commit ⇒ `Object`

#delete(doc_num) ⇒ `Object`

#delete_docs_with_term(term) ⇒ `Object`

#deleted?(n) ⇒ `Boolean`

#do_delete(doc_num) ⇒ `Object`

#do_set_norm(doc, field, value) ⇒ `Object`

#doc_freq(t) ⇒ `Object`

#get_document(n) ⇒ `Object`

#get_document_with_term(term) ⇒ `Object`

#get_norms(field, bytes = nil, offset = nil) ⇒ `Object`

#get_term_vector(doc_number, field) ⇒ `Object`

#get_term_vectors(doc_number) ⇒ `Object`

#has_deletions? ⇒ `Boolean`

#latest? ⇒ `Boolean`

#max_doc ⇒ `Object`

#num_docs ⇒ `Object`

#set_norm(doc, field, value) ⇒ `Object`

#term_docs ⇒ `Object`

#term_docs_for(term) ⇒ `Object`

#term_positions ⇒ `Object`

#term_positions_for(term) ⇒ `Object`

#terms ⇒ `Object`

#terms_from(t) ⇒ `Object`

#undelete_all ⇒ `Object`