Class: Ferret::Index::IndexReader

Inherits:
Object
  • Object
show all
Includes:
MonitorMixin
Defined in:
lib/ferret/index/index_reader.rb

Overview

IndexReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, class which implements it is searchable.

Concrete subclasses of IndexReader are usually constructed with a call to one of the static open() methods, e.g. #open.

For efficiency, in this API documents are often referred to via _document numbers_, non-negative integers which each name a unique document in the index. These document numbers are ephemeral, ie they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.

An IndexReader can be opened on a directory for which an IndexWriter is opened already, but it cannot be used to delete documents from the index then.

Direct Known Subclasses

MultiReader, SegmentReader

Defined Under Namespace

Classes: FieldOption

Constant Summary collapse

FILENAME_EXTENSIONS =

This array contains all filename extensions used by Lucene’s index files, with one exception, namely the extension made up from .f + a number. Also note that two of Lucene’s files (deletable and segments) don’t have any filename extension.

["cfs",
"fnm",
"fdx",
"fdt",
"tii",
"tis",
"frq",
"prx",
"del",
"tvx",
"tvd",
"tvf",
"tvp"]

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ IndexReader

To create an IndexReader use the IndexReader.open method. This method should only be used by subclasses.

directory

Directory where IndexReader files reside.

segment_infos

Used for write-l

close_directory

close the directory when the index reader is closed



71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/ferret/index/index_reader.rb', line 71

def initialize(directory, segment_infos = nil,
               close_directory = false, directory_owner = false)
  super()
  @directory = directory
  @close_directory = close_directory
  @segment_infos = segment_infos
  @directory_owner = directory_owner

  @has_changes = false
  @stale = false
  @write_lock = nil

  #ObjectSpace.define_finalizer(self, lambda { |id| @write_lock.release() if @write_lock})
end

Instance Attribute Details

#directoryObject (readonly)

Returns the value of attribute directory.



40
41
42
# File 'lib/ferret/index/index_reader.rb', line 40

def directory
  @directory
end

Class Method Details

.get_current_version(directory) ⇒ Object

Reads version number from segments files. The version number counts the number of changes of the index.

directory

where the index resides.

returns

version number.

raises

IOError if segments file cannot be read.



130
131
132
# File 'lib/ferret/index/index_reader.rb', line 130

def IndexReader.get_current_version(directory)
  return SegmentInfos.read_current_version(directory)
end

.index_exists?(directory) ⇒ Boolean

Returns true if an index exists at the specified directory. If the directory does not exist or if there is no index in it.

directory

the directory to check for an index

returns

true if an index exists; false otherwise

raises

IOError if there is a problem with accessing the index

Returns:

  • (Boolean)


176
177
178
# File 'lib/ferret/index/index_reader.rb', line 176

def IndexReader.index_exists?(directory)
  return directory.exists?("segments")
end

.open(directory, close_directory = true, infos = nil) ⇒ Object

Returns an index reader to read the index in the directory

directory

This can either be a Directory object or you can pass nil (RamDirectory is created) or a path (FSDirectory is created). If you chose the second or third options, you should leave close_directory as true and infos as nil.

close_directory

True if you want the IndexReader to close the directory when the IndexReader is closed. You’ll want to set this to false if other objects are using the same directory object.

infos

Expert: This can be used to read an different version of the index but should really be left alone.



99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/ferret/index/index_reader.rb', line 99

def IndexReader.open(directory, close_directory = true, infos = nil)
  if directory.nil?
    directory = Ferret::Store::RAMDirectory.new
  elsif directory.is_a?(String)
    directory = Ferret::Store::FSDirectory.new(directory, true)
  end
  directory.synchronize do # in- & inter-process sync
    commit_lock = directory.make_lock(IndexWriter::COMMIT_LOCK_NAME)
    commit_lock.while_locked() do
      if infos.nil?
        infos = SegmentInfos.new()
        infos.read(directory)
      end
      if (infos.size() == 1) # index is optimized
        return SegmentReader.get(infos[0], infos, close_directory)
      end
      readers = Array.new(infos.size)
      infos.size.times do |i|
        readers[i] = SegmentReader.get(infos[i])
      end
      return MultiReader.new(readers, directory, infos, close_directory)
    end
  end
end

Instance Method Details

#acquire_write_lockObject

Tries to acquire the WriteLock on this directory.

This method is only valid if this IndexReader is directory owner.

raises

IOError If WriteLock cannot be acquired.



325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
# File 'lib/ferret/index/index_reader.rb', line 325

def acquire_write_lock()
  if @stale
    raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations"
  end

  if (@write_lock == nil) 
    @write_lock = @directory.make_lock(IndexWriter::WRITE_LOCK_NAME)
    if not @write_lock.obtain(IndexWriter::WRITE_LOCK_TIMEOUT) # obtain write lock
      raise IOError, "Index locked for write: " + @write_lock
    end

    # we have to check whether index has changed since this reader was opened.
    # if so, this reader is no longer valid for deletion
    if (SegmentInfos.read_current_version(@directory) > @segment_infos.version()) 
      @stale = true
      @write_lock.release()
      @write_lock = nil
      raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations"
    end
  end
end

#closeObject

Closes files associated with this index. Also saves any new deletions to disk. No other methods should be called after this has been called.



433
434
435
436
437
438
439
# File 'lib/ferret/index/index_reader.rb', line 433

def close()
  synchronize do
    commit()
    do_close()
    @directory.close() if @close_directory
  end
end

#commitObject

Commit changes resulting from delete, undelete_all, or set_norm operations

raises

IOError



407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# File 'lib/ferret/index/index_reader.rb', line 407

def commit()
  synchronize do
    if @has_changes
      if @directory_owner
        @directory.synchronize do # in- & inter-process sync
          commit_lock = @directory.make_lock(IndexWriter::COMMIT_LOCK_NAME)
          commit_lock.while_locked do
            do_commit()
            @segment_infos.write(@directory)
          end
        end
        if (@write_lock != nil) 
          @write_lock.release()  # release write lock
          @write_lock = nil
        end
      else
        do_commit()
      end
    end
    @has_changes = false
  end
end

#delete(doc_num) ⇒ Object

Deletes the document numbered doc_num. Once a document is deleted it will not appear in TermDocEnum or TermPostitions enumerations. Attempts to read its field with the @link #documentend method will result in an error. The presence of this document may still be reflected in the @link #docFreqendstatistic, though this will be corrected eventually as the index is further modified.



359
360
361
362
363
364
365
366
# File 'lib/ferret/index/index_reader.rb', line 359

def delete(doc_num)
  synchronize do
    acquire_write_lock() if @directory_owner
    do_delete(doc_num)
    @has_changes = true
  end
  return 1
end

#delete_docs_with_term(term) ⇒ Object

Deletes all documents containing term. This is useful if one uses a document field to hold a unique ID string for the document. Then to delete such a document, one merely constructs a term with the appropriate field and the unique ID string as its text and passes it to this method. Returns the number of documents deleted. See #delete for information about when this deletion will become effective.



380
381
382
383
384
385
386
387
388
389
390
391
392
393
# File 'lib/ferret/index/index_reader.rb', line 380

def delete_docs_with_term(term)
  docs = term_docs_for(term)
  if (docs == nil) then return 0 end
  n = 0
  begin 
    while (docs.next?) 
      delete(docs.doc)
      n += 1
    end
  ensure 
    docs.close()
  end
  return n
end

#deleted?(n) ⇒ Boolean

Returns true if document n has been deleted

Returns:

  • (Boolean)

Raises:

  • (NotImplementedError)


218
219
220
# File 'lib/ferret/index/index_reader.rb', line 218

def deleted?(n)
  raise NotImplementedError
end

#do_delete(doc_num) ⇒ Object

Implements deletion of the document numbered doc_num. Applications should call @link #delete(int)endor @link #delete(Term)end.

Raises:

  • (NotImplementedError)


370
371
372
# File 'lib/ferret/index/index_reader.rb', line 370

def do_delete(doc_num)
  raise NotImplementedError
end

#do_set_norm(doc, field, value) ⇒ Object

Implements set_norm in subclass.

Raises:

  • (NotImplementedError)


255
256
257
# File 'lib/ferret/index/index_reader.rb', line 255

def do_set_norm(doc, field, value) 
  raise NotImplementedError
end

#doc_freq(t) ⇒ Object

Returns the number of documents containing the term t.

Raises:

  • (NotImplementedError)


273
274
275
# File 'lib/ferret/index/index_reader.rb', line 273

def doc_freq(t)
  raise NotImplementedError
end

#get_document(n) ⇒ Object

Returns the stored fields of the n<sup>th</sup> Document in this index.

Raises:

  • (NotImplementedError)


195
196
197
# File 'lib/ferret/index/index_reader.rb', line 195

def get_document(n)
  raise NotImplementedError
end

#get_document_with_term(term) ⇒ Object

Returns the first document with the term term. This is useful, for example, if we are indexing rows from a database. We can store the id of each row in a field in the index and use this method get the document by the id. Hence, only one document is returned.

term: The term we are searching for.



205
206
207
208
209
210
211
212
213
214
215
# File 'lib/ferret/index/index_reader.rb', line 205

def get_document_with_term(term)
  docs = term_docs_for(term)
  if (docs == nil) then return nil end
  document = nil
  begin 
    document = get_document(docs.doc) if docs.next?
  ensure 
    docs.close()
  end
  return document
end

#get_norms(field, bytes = nil, offset = nil) ⇒ Object

Returns the byte-encoded normalization factor for the named field of every document. This is used by the search code to score documents.

See Field#boost

Raises:

  • (NotImplementedError)


231
232
233
# File 'lib/ferret/index/index_reader.rb', line 231

def get_norms(field, bytes=nil, offset=nil)
  raise NotImplementedError
end

#get_term_vector(doc_number, field) ⇒ Object

Return a term vector for the specified document and field. The returned vector contains terms and frequencies for the terms in the specified field of this document, if the field had the storeTermVector flag set. If termvectors had been stored with positions or offsets, a TermDocPosEnumVector is returned.

doc_number

document for which the term vector is returned

field

field for which the term vector is returned.

returns

term vector May be nil if field does not exist in the specified document or term vector was not stored.

raises

IOError if index cannot be accessed

See Field.TermVector

Raises:

  • (NotImplementedError)


165
166
167
# File 'lib/ferret/index/index_reader.rb', line 165

def get_term_vector(doc_number, field)
  raise NotImplementedError
end

#get_term_vectors(doc_number) ⇒ Object

Return an array of term vectors for the specified document. The array contains a vector for each vectorized field in the document. Each vector contains terms and frequencies for all terms in a given vectorized field. If no such fields existed, the method returns nil. The term vectors that are returned my either be of type TermFreqVector or of type TermDocPosEnumVector if positions or offsets have been stored.

doc_number

document for which term vectors are returned

returns

array of term vectors. May be nil if no term vectors have been stored for the specified document.

raises

IOError if index cannot be accessed

See Field.TermVector

Raises:

  • (NotImplementedError)


147
148
149
# File 'lib/ferret/index/index_reader.rb', line 147

def get_term_vectors(doc_number)
  raise NotImplementedError
end

#has_deletions?Boolean

Returns true if any documents have been deleted

Returns:

  • (Boolean)

Raises:

  • (NotImplementedError)


223
224
225
# File 'lib/ferret/index/index_reader.rb', line 223

def has_deletions?()
  raise NotImplementedError
end

#latest?Boolean

Returns true if the reader is reading from the latest version of the index.

Returns:

  • (Boolean)


349
350
351
# File 'lib/ferret/index/index_reader.rb', line 349

def latest?()
  SegmentInfos.read_current_version(@directory) == @segment_infos.version()
end

#max_docObject

Returns one greater than the largest possible document number.

This may be used to, e.g., determine how big to allocate an array which will have an element for every document number in an index.

Raises:

  • (NotImplementedError)


189
190
191
# File 'lib/ferret/index/index_reader.rb', line 189

def max_doc()
  raise NotImplementedError
end

#num_docsObject

Returns the number of documents in this index.

Raises:

  • (NotImplementedError)


181
182
183
# File 'lib/ferret/index/index_reader.rb', line 181

def num_docs()
  raise NotImplementedError
end

#set_norm(doc, field, value) ⇒ Object

Expert: Resets the normalization factor for the named field of the named document. The norm represents the product of the field’s Field#boost and its Similarity#length_norm length normalization. Thus, to preserve the length normalization values when resetting this, one should base the new value upon the old.

See #get_norms See Similarity#decode_norm



243
244
245
246
247
248
249
250
251
252
# File 'lib/ferret/index/index_reader.rb', line 243

def set_norm(doc, field, value)
  synchronize do
    value = Similarity.encode_norm(value) if value.is_a? Float
    if(@directory_owner)
      acquire_write_lock()
    end
    do_set_norm(doc, field, value)
    @has_changes = true
  end
end

#term_docsObject

Returns an unpositioned TermDocEnum enumerator.

Raises:

  • (NotImplementedError)


293
294
295
# File 'lib/ferret/index/index_reader.rb', line 293

def term_docs()
  raise NotImplementedError
end

#term_docs_for(term) ⇒ Object

Returns an enumeration of all the documents which contain term. For each document, the document number, the frequency of the term in that document is also provided, for use in search scoring. Thus, this method implements the mapping:

Term => <doc_num, freq><sup>*</sup>

The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.



286
287
288
289
290
# File 'lib/ferret/index/index_reader.rb', line 286

def term_docs_for(term)
  term_docs = term_docs()
  term_docs.seek(term)
  return term_docs
end

#term_positionsObject

Returns an unpositioned @link TermDocPosEnumendenumerator.

Raises:

  • (NotImplementedError)


316
317
318
# File 'lib/ferret/index/index_reader.rb', line 316

def term_positions()
  raise NotImplementedError
end

#term_positions_for(term) ⇒ Object

Returns an enumeration of all the documents which contain term. For each document, in addition to the document number and frequency of the term in that document, a list of all of the ordinal positions of the term in the document is available. Thus, this method implements the mapping:

Term => <doc_num, freq, < pos<sub>1</sub>, pos<sub>2</sub>, ...
pos<sub>freq-1</sub> > > <sup>*</sup>

This positional information faciliates phrase and proximity searching. The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.



309
310
311
312
313
# File 'lib/ferret/index/index_reader.rb', line 309

def term_positions_for(term)
  term_positions = term_positions()
  term_positions.seek(term)
  return term_positions
end

#termsObject

Returns an enumeration of all the terms in the index. Each term is greater than all that precede it in the enumeration.

Raises:

  • (NotImplementedError)


261
262
263
# File 'lib/ferret/index/index_reader.rb', line 261

def terms()
  raise NotImplementedError
end

#terms_from(t) ⇒ Object

Returns an enumeration of all terms after a given term.

Each term is greater than all that precede it in the enumeration.

Raises:

  • (NotImplementedError)


268
269
270
# File 'lib/ferret/index/index_reader.rb', line 268

def terms_from(t)
  raise NotImplementedError
end

#undelete_allObject

Undeletes all documents currently marked as deleted in this index.



396
397
398
399
400
401
402
# File 'lib/ferret/index/index_reader.rb', line 396

def undelete_all()
  synchronize do
    acquire_write_lock() if @directory_owner
    do_undelete_all()
    @has_changes = true
  end
end