Class: Ferret::Index::IndexReader

Inherits:
Object
  • Object
show all
Includes:
MonitorMixin
Defined in:
lib/ferret/index/index_reader.rb

Overview

IndexReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, class which implements it is searchable.

Concrete subclasses of IndexReader are usually constructed with a call to one of the static open() methods, e.g. #open.

For efficiency, in this API documents are often referred to via _document numbers_, non-negative integers which each name a unique document in the index. These document numbers are ephemeral, ie they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.

An IndexReader can be opened on a directory for which an IndexWriter is opened already, but it cannot be used to delete documents from the index then.

Direct Known Subclasses

MultiReader, SegmentReader

Defined Under Namespace

Classes: FieldOption

Constant Summary collapse

FILENAME_EXTENSIONS =

This array contains all filename extensions used by Lucene’s index files, with one exception, namely the extension made up from .f + a number. Also note that two of Lucene’s files (deletable and segments) don’t have any filename extension.

["cfs",
"fnm",
"fdx",
"fdt",
"tii",
"tis",
"frq",
"prx",
"del",
"tvx",
"tvd",
"tvf",
"tvp"]

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ IndexReader

To create an IndexReader use the IndexReader.open method. This method should only be used by subclasses.

directory

Directory where IndexReader files reside.

segment_infos

Used for write-l

close_directory

close the directory when the index reader is closed



71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/ferret/index/index_reader.rb', line 71

def initialize(directory, segment_infos = nil,
               close_directory = false, directory_owner = false)
  super()
  @directory = directory
  @close_directory = close_directory
  @segment_infos = segment_infos
  @directory_owner = directory_owner

  @has_changes = false
  @stale = false
  @write_lock = nil

  #ObjectSpace.define_finalizer(self, lambda { |id| @write_lock.release() if @write_lock})
end

Instance Attribute Details

#directoryObject (readonly)

Returns the value of attribute directory.



40
41
42
# File 'lib/ferret/index/index_reader.rb', line 40

def directory
  @directory
end

Class Method Details

.get_current_version(directory) ⇒ Object

Reads version number from segments files. The version number counts the number of changes of the index.

directory

where the index resides.

returns

version number.

raises

IOError if segments file cannot be read.



130
131
132
# File 'lib/ferret/index/index_reader.rb', line 130

def IndexReader.get_current_version(directory)
  return SegmentInfos.read_current_version(directory)
end

.index_exists?(directory) ⇒ Boolean

Returns true if an index exists at the specified directory. If the directory does not exist or if there is no index in it.

directory

the directory to check for an index

returns

true if an index exists; false otherwise

raises

IOError if there is a problem with accessing the index

Returns:

  • (Boolean)


176
177
178
# File 'lib/ferret/index/index_reader.rb', line 176

def IndexReader.index_exists?(directory)
  return directory.exists?("segments")
end

.open(directory, close_directory = true, infos = nil) ⇒ Object

Returns an index reader to read the index in the directory

directory

This can either be a Directory object or you can pass nil (RamDirectory is created) or a path (FSDirectory is created). If you chose the second or third options, you should leave close_directory as true and infos as nil.

close_directory

True if you want the IndexReader to close the directory when the IndexReader is closed. You’ll want to set this to false if other objects are using the same directory object.

infos

Expert: This can be used to read an different version of the index but should really be left alone.



99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/ferret/index/index_reader.rb', line 99

def IndexReader.open(directory, close_directory = true, infos = nil)
  if directory.nil?
    directory = Ferret::Store::RAMDirectory.new
  elsif directory.is_a?(String)
    directory = Ferret::Store::FSDirectory.new(directory, false)
  end
  directory.synchronize do # in- & inter-process sync
    commit_lock = directory.make_lock(IndexWriter::COMMIT_LOCK_NAME)
    commit_lock.while_locked() do
      if infos.nil?
        infos = SegmentInfos.new()
        infos.read(directory)
      end
      if (infos.size() == 1) # index is optimized
        return SegmentReader.get(infos[0], infos, close_directory)
      end
      readers = Array.new(infos.size)
      infos.size.times do |i|
        readers[i] = SegmentReader.get(infos[i])
      end
      return MultiReader.new(readers, directory, infos, close_directory)
    end
  end
end

Instance Method Details

#acquire_write_lockObject

Tries to acquire the WriteLock on this directory.

This method is only valid if this IndexReader is directory owner.

raises

IOError If WriteLock cannot be acquired.



340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
# File 'lib/ferret/index/index_reader.rb', line 340

def acquire_write_lock()
  if @stale
    raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations"
  end

  if (@write_lock == nil) 
    @write_lock = @directory.make_lock(IndexWriter::WRITE_LOCK_NAME)
    if not @write_lock.obtain(IndexWriter::WRITE_LOCK_TIMEOUT) # obtain write lock
      raise IOError, "Index locked for write: " + @write_lock
    end

    # we have to check whether index has changed since this reader was opened.
    # if so, this reader is no longer valid for deletion
    if (SegmentInfos.read_current_version(@directory) > @segment_infos.version()) 
      @stale = true
      @write_lock.release()
      @write_lock = nil
      raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations"
    end
  end
end

#closeObject

Closes files associated with this index. Also saves any new deletions to disk. No other methods should be called after this has been called.



448
449
450
451
452
453
454
# File 'lib/ferret/index/index_reader.rb', line 448

def close()
  synchronize do
    commit()
    do_close()
    @directory.close() if @close_directory
  end
end

#commitObject

Commit changes resulting from delete, undelete_all, or set_norm operations

raises

IOError



422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
# File 'lib/ferret/index/index_reader.rb', line 422

def commit()
  synchronize do
    if @has_changes
      if @directory_owner
        @directory.synchronize do # in- & inter-process sync
          commit_lock = @directory.make_lock(IndexWriter::COMMIT_LOCK_NAME)
          commit_lock.while_locked do
            do_commit()
            @segment_infos.write(@directory)
          end
        end
        if (@write_lock != nil) 
          @write_lock.release()  # release write lock
          @write_lock = nil
        end
      else
        do_commit()
      end
    end
    @has_changes = false
  end
end

#delete(doc_num) ⇒ Object

Deletes the document numbered doc_num. Once a document is deleted it will not appear in TermDocEnum or TermPostitions enumerations. Attempts to read its field with the @link #documentend method will result in an error. The presence of this document may still be reflected in the @link #docFreqendstatistic, though this will be corrected eventually as the index is further modified.



374
375
376
377
378
379
380
381
# File 'lib/ferret/index/index_reader.rb', line 374

def delete(doc_num)
  synchronize do
    acquire_write_lock() if @directory_owner
    do_delete(doc_num)
    @has_changes = true
  end
  return 1
end

#delete_docs_with_term(term) ⇒ Object

Deletes all documents containing term. This is useful if one uses a document field to hold a unique ID string for the document. Then to delete such a document, one merely constructs a term with the appropriate field and the unique ID string as its text and passes it to this method. Returns the number of documents deleted. See #delete for information about when this deletion will become effective.



395
396
397
398
399
400
401
402
403
404
405
406
407
408
# File 'lib/ferret/index/index_reader.rb', line 395

def delete_docs_with_term(term)
  docs = term_docs_for(term)
  if (docs == nil) then return 0 end
  n = 0
  begin 
    while (docs.next?) 
      delete(docs.doc)
      n += 1
    end
  ensure 
    docs.close()
  end
  return n
end

#deleted?(n) ⇒ Boolean

Returns true if document n has been deleted

Returns:

  • (Boolean)

Raises:

  • (NotImplementedError)


218
219
220
# File 'lib/ferret/index/index_reader.rb', line 218

def deleted?(n)
  raise NotImplementedError
end

#do_delete(doc_num) ⇒ Object

Implements deletion of the document numbered doc_num. Applications should call @link #delete(int)endor @link #delete(Term)end.

Raises:

  • (NotImplementedError)


385
386
387
# File 'lib/ferret/index/index_reader.rb', line 385

def do_delete(doc_num)
  raise NotImplementedError
end

#do_set_norm(doc, field, value) ⇒ Object

Implements set_norm in subclass.

Raises:

  • (NotImplementedError)


270
271
272
# File 'lib/ferret/index/index_reader.rb', line 270

def do_set_norm(doc, field, value) 
  raise NotImplementedError
end

#doc_freq(t) ⇒ Object

Returns the number of documents containing the term t.

Raises:

  • (NotImplementedError)


288
289
290
# File 'lib/ferret/index/index_reader.rb', line 288

def doc_freq(t)
  raise NotImplementedError
end

#get_document(n) ⇒ Object

Returns the stored fields of the n<sup>th</sup> Document in this index.

Raises:

  • (NotImplementedError)


195
196
197
# File 'lib/ferret/index/index_reader.rb', line 195

def get_document(n)
  raise NotImplementedError
end

#get_document_with_term(term) ⇒ Object

Returns the first document with the term term. This is useful, for example, if we are indexing rows from a database. We can store the id of each row in a field in the index and use this method get the document by the id. Hence, only one document is returned.

term: The term we are searching for.



205
206
207
208
209
210
211
212
213
214
215
# File 'lib/ferret/index/index_reader.rb', line 205

def get_document_with_term(term)
  docs = term_docs_for(term)
  if (docs == nil) then return nil end
  document = nil
  begin 
    document = get_document(docs.doc) if docs.next?
  ensure 
    docs.close()
  end
  return document
end

#get_norms(field) ⇒ Object

Returns the byte-encoded normalization factor for the named field of every document. This is used by the search code to score documents.

See Field#boost

Raises:

  • (NotImplementedError)


238
239
240
# File 'lib/ferret/index/index_reader.rb', line 238

def get_norms(field)
  raise NotImplementedError
end

#get_norms_into(field, bytes, offset) ⇒ Object

Read norms into a pre-allocated array. This is used as an optimization of get_norms.

See Field#boost

Raises:

  • (NotImplementedError)


246
247
248
# File 'lib/ferret/index/index_reader.rb', line 246

def get_norms_into(field, bytes, offset)
  raise NotImplementedError
end

#get_term_vector(doc_number, field) ⇒ Object

Return a term vector for the specified document and field. The returned vector contains terms and frequencies for the terms in the specified field of this document, if the field had the storeTermVector flag set. If termvectors had been stored with positions or offsets, a TermDocPosEnumVector is returned.

doc_number

document for which the term vector is returned

field

field for which the term vector is returned.

returns

term vector May be nil if field does not exist in the specified document or term vector was not stored.

raises

IOError if index cannot be accessed

See Field::TermVector

Raises:

  • (NotImplementedError)


165
166
167
# File 'lib/ferret/index/index_reader.rb', line 165

def get_term_vector(doc_number, field)
  raise NotImplementedError
end

#get_term_vectors(doc_number) ⇒ Object

Return an array of term vectors for the specified document. The array contains a vector for each vectorized field in the document. Each vector contains terms and frequencies for all terms in a given vectorized field. If no such fields existed, the method returns nil. The term vectors that are returned my either be of type TermFreqVector or of type TermDocPosEnumVector if positions or offsets have been stored.

doc_number

document for which term vectors are returned

returns

array of term vectors. May be nil if no term vectors have been stored for the specified document.

raises

IOError if index cannot be accessed

See Field::TermVector

Raises:

  • (NotImplementedError)


147
148
149
# File 'lib/ferret/index/index_reader.rb', line 147

def get_term_vectors(doc_number)
  raise NotImplementedError
end

#has_deletions?Boolean

Returns true if any documents have been deleted

Returns:

  • (Boolean)

Raises:

  • (NotImplementedError)


223
224
225
# File 'lib/ferret/index/index_reader.rb', line 223

def has_deletions?()
  raise NotImplementedError
end

#has_norms?(field) ⇒ Boolean

Returns true if there are norms stored for this field.

Returns:

  • (Boolean)


228
229
230
231
232
# File 'lib/ferret/index/index_reader.rb', line 228

def has_norms?(field)
  # backward compatible implementation.
  # SegmentReader has an efficient implementation.
  return (get_norms(field) != nil)
end

#latest?Boolean

Returns true if the reader is reading from the latest version of the index.

Returns:

  • (Boolean)


364
365
366
# File 'lib/ferret/index/index_reader.rb', line 364

def latest?()
  SegmentInfos.read_current_version(@directory) == @segment_infos.version()
end

#max_docObject

Returns one greater than the largest possible document number.

This may be used to, e.g., determine how big to allocate an array which will have an element for every document number in an index.

Raises:

  • (NotImplementedError)


189
190
191
# File 'lib/ferret/index/index_reader.rb', line 189

def max_doc()
  raise NotImplementedError
end

#num_docsObject

Returns the number of documents in this index.

Raises:

  • (NotImplementedError)


181
182
183
# File 'lib/ferret/index/index_reader.rb', line 181

def num_docs()
  raise NotImplementedError
end

#set_norm(doc, field, value) ⇒ Object

Expert: Resets the normalization factor for the named field of the named document. The norm represents the product of the field’s Field#boost and its Similarity#length_norm length normalization. Thus, to preserve the length normalization values when resetting this, one should base the new value upon the old.

See #get_norms See Similarity#decode_norm



258
259
260
261
262
263
264
265
266
267
# File 'lib/ferret/index/index_reader.rb', line 258

def set_norm(doc, field, value)
  synchronize do
    value = Similarity.encode_norm(value) if value.is_a? Float
    if(@directory_owner)
      acquire_write_lock()
    end
    do_set_norm(doc, field, value)
    @has_changes = true
  end
end

#term_docsObject

Returns an unpositioned TermDocEnum enumerator.

Raises:

  • (NotImplementedError)


308
309
310
# File 'lib/ferret/index/index_reader.rb', line 308

def term_docs()
  raise NotImplementedError
end

#term_docs_for(term) ⇒ Object

Returns an enumeration of all the documents which contain term. For each document, the document number, the frequency of the term in that document is also provided, for use in search scoring. Thus, this method implements the mapping:

Term => <doc_num, freq><sup>*</sup>

The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.



301
302
303
304
305
# File 'lib/ferret/index/index_reader.rb', line 301

def term_docs_for(term)
  term_docs = term_docs()
  term_docs.seek(term)
  return term_docs
end

#term_positionsObject

Returns an unpositioned @link TermDocPosEnumendenumerator.

Raises:

  • (NotImplementedError)


331
332
333
# File 'lib/ferret/index/index_reader.rb', line 331

def term_positions()
  raise NotImplementedError
end

#term_positions_for(term) ⇒ Object

Returns an enumeration of all the documents which contain term. For each document, in addition to the document number and frequency of the term in that document, a list of all of the ordinal positions of the term in the document is available. Thus, this method implements the mapping:

Term => <doc_num, freq, < pos<sub>1</sub>, pos<sub>2</sub>, ...
pos<sub>freq-1</sub> > > <sup>*</sup>

This positional information faciliates phrase and proximity searching. The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.



324
325
326
327
328
# File 'lib/ferret/index/index_reader.rb', line 324

def term_positions_for(term)
  term_positions = term_positions()
  term_positions.seek(term)
  return term_positions
end

#termsObject

Returns an enumeration of all the terms in the index. Each term is greater than all that precede it in the enumeration.

Raises:

  • (NotImplementedError)


276
277
278
# File 'lib/ferret/index/index_reader.rb', line 276

def terms()
  raise NotImplementedError
end

#terms_from(t) ⇒ Object

Returns an enumeration of all terms after a given term.

Each term is greater than all that precede it in the enumeration.

Raises:

  • (NotImplementedError)


283
284
285
# File 'lib/ferret/index/index_reader.rb', line 283

def terms_from(t)
  raise NotImplementedError
end

#undelete_allObject

Undeletes all documents currently marked as deleted in this index.



411
412
413
414
415
416
417
# File 'lib/ferret/index/index_reader.rb', line 411

def undelete_all()
  synchronize do
    acquire_write_lock() if @directory_owner
    do_undelete_all()
    @has_changes = true
  end
end