Class: Ferret::Index::IndexReader
- Inherits:
-
Object
- Object
- Ferret::Index::IndexReader
- Includes:
- MonitorMixin
- Defined in:
- lib/ferret/index/index_reader.rb
Overview
IndexReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, class which implements it is searchable.
Concrete subclasses of IndexReader are usually constructed with a call to one of the static open()
methods, e.g. #open
.
For efficiency, in this API documents are often referred to via _document numbers_, non-negative integers which each name a unique document in the index. These document numbers are ephemeral, ie they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.
An IndexReader can be opened on a directory for which an IndexWriter is opened already, but it cannot be used to delete documents from the index then.
Direct Known Subclasses
Defined Under Namespace
Classes: FieldOption
Constant Summary collapse
- FILENAME_EXTENSIONS =
This array contains all filename extensions used by Lucene’s index files, with one exception, namely the extension made up from
.f
+ a number. Also note that two of Lucene’s files (deletable
andsegments
) don’t have any filename extension. ["cfs", "fnm", "fdx", "fdt", "tii", "tis", "frq", "prx", "del", "tvx", "tvd", "tvf", "tvp"]
Instance Attribute Summary collapse
-
#directory ⇒ Object
readonly
Returns the value of attribute directory.
Class Method Summary collapse
-
.get_current_version(directory) ⇒ Object
Reads version number from segments files.
-
.index_exists?(directory) ⇒ Boolean
Returns
true
if an index exists at the specified directory. -
.open(directory, close_directory = true, infos = nil) ⇒ Object
Returns an index reader to read the index in the directory.
Instance Method Summary collapse
-
#acquire_write_lock ⇒ Object
Tries to acquire the WriteLock on this directory.
-
#close ⇒ Object
Closes files associated with this index.
-
#commit ⇒ Object
Commit changes resulting from delete, undelete_all, or set_norm operations.
-
#delete(doc_num) ⇒ Object
Deletes the document numbered
doc_num
. -
#delete_docs_with_term(term) ⇒ Object
Deletes all documents containing
term
. -
#deleted?(n) ⇒ Boolean
Returns true if document n has been deleted.
-
#do_delete(doc_num) ⇒ Object
Implements deletion of the document numbered
doc_num
. -
#do_set_norm(doc, field, value) ⇒ Object
Implements set_norm in subclass.
-
#doc_freq(t) ⇒ Object
Returns the number of documents containing the term
t
. -
#get_document(n) ⇒ Object
Returns the stored fields of the
n
<sup>th</sup>Document
in this index. -
#get_document_with_term(term) ⇒ Object
Returns the first document with the term
term
. -
#get_norms(field, bytes = nil, offset = nil) ⇒ Object
Returns the byte-encoded normalization factor for the named field of every document.
-
#get_term_vector(doc_number, field) ⇒ Object
Return a term vector for the specified document and field.
-
#get_term_vectors(doc_number) ⇒ Object
Return an array of term vectors for the specified document.
-
#has_deletions? ⇒ Boolean
Returns true if any documents have been deleted.
-
#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ IndexReader
constructor
To create an IndexReader use the IndexReader.open method.
-
#latest? ⇒ Boolean
Returns true if the reader is reading from the latest version of the index.
-
#max_doc ⇒ Object
Returns one greater than the largest possible document number.
-
#num_docs ⇒ Object
Returns the number of documents in this index.
-
#set_norm(doc, field, value) ⇒ Object
Expert: Resets the normalization factor for the named field of the named document.
-
#term_docs ⇒ Object
Returns an unpositioned TermDocEnum enumerator.
-
#term_docs_for(term) ⇒ Object
Returns an enumeration of all the documents which contain
term
. -
#term_positions ⇒ Object
Returns an unpositioned @link TermDocPosEnumendenumerator.
-
#term_positions_for(term) ⇒ Object
Returns an enumeration of all the documents which contain
term
. -
#terms ⇒ Object
Returns an enumeration of all the terms in the index.
-
#terms_from(t) ⇒ Object
Returns an enumeration of all terms after a given term.
-
#undelete_all ⇒ Object
Undeletes all documents currently marked as deleted in this index.
Constructor Details
#initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) ⇒ IndexReader
To create an IndexReader use the IndexReader.open method. This method should only be used by subclasses.
- directory
-
Directory where IndexReader files reside.
- segment_infos
-
Used for write-l
- close_directory
-
close the directory when the index reader is closed
71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/ferret/index/index_reader.rb', line 71 def initialize(directory, segment_infos = nil, close_directory = false, directory_owner = false) super() @directory = directory @close_directory = close_directory @segment_infos = segment_infos @directory_owner = directory_owner @has_changes = false @stale = false @write_lock = nil #ObjectSpace.define_finalizer(self, lambda { |id| @write_lock.release() if @write_lock}) end |
Instance Attribute Details
#directory ⇒ Object (readonly)
Returns the value of attribute directory.
40 41 42 |
# File 'lib/ferret/index/index_reader.rb', line 40 def directory @directory end |
Class Method Details
.get_current_version(directory) ⇒ Object
Reads version number from segments files. The version number counts the number of changes of the index.
- directory
-
where the index resides.
- returns
-
version number.
- raises
-
IOError if segments file cannot be read.
130 131 132 |
# File 'lib/ferret/index/index_reader.rb', line 130 def IndexReader.get_current_version(directory) return SegmentInfos.read_current_version(directory) end |
.index_exists?(directory) ⇒ Boolean
Returns true
if an index exists at the specified directory. If the directory does not exist or if there is no index in it.
- directory
-
the directory to check for an index
- returns
-
true
if an index exists;false
otherwise - raises
-
IOError if there is a problem with accessing the index
176 177 178 |
# File 'lib/ferret/index/index_reader.rb', line 176 def IndexReader.index_exists?(directory) return directory.exists?("segments") end |
.open(directory, close_directory = true, infos = nil) ⇒ Object
Returns an index reader to read the index in the directory
- directory
-
This can either be a Directory object or you can pass nil (RamDirectory is created) or a path (FSDirectory is created). If you chose the second or third options, you should leave close_directory as true and infos as nil.
- close_directory
-
True if you want the IndexReader to close the directory when the IndexReader is closed. You’ll want to set this to false if other objects are using the same directory object.
- infos
-
Expert: This can be used to read an different version of the index but should really be left alone.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/ferret/index/index_reader.rb', line 99 def IndexReader.open(directory, close_directory = true, infos = nil) if directory.nil? directory = Ferret::Store::RAMDirectory.new elsif directory.is_a?(String) directory = Ferret::Store::FSDirectory.new(directory, true) end directory.synchronize do # in- & inter-process sync commit_lock = directory.make_lock(IndexWriter::COMMIT_LOCK_NAME) commit_lock.while_locked() do if infos.nil? infos = SegmentInfos.new() infos.read(directory) end if (infos.size() == 1) # index is optimized return SegmentReader.get(infos[0], infos, close_directory) end readers = Array.new(infos.size) infos.size.times do |i| readers[i] = SegmentReader.get(infos[i]) end return MultiReader.new(readers, directory, infos, close_directory) end end end |
Instance Method Details
#acquire_write_lock ⇒ Object
Tries to acquire the WriteLock on this directory.
This method is only valid if this IndexReader is directory owner.
- raises
-
IOError If WriteLock cannot be acquired.
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 |
# File 'lib/ferret/index/index_reader.rb', line 325 def acquire_write_lock() if @stale raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations" end if (@write_lock == nil) @write_lock = @directory.make_lock(IndexWriter::WRITE_LOCK_NAME) if not @write_lock.obtain(IndexWriter::WRITE_LOCK_TIMEOUT) # obtain write lock raise IOError, "Index locked for write: " + @write_lock end # we have to check whether index has changed since this reader was opened. # if so, this reader is no longer valid for deletion if (SegmentInfos.read_current_version(@directory) > @segment_infos.version()) @stale = true @write_lock.release() @write_lock = nil raise IOError, "IndexReader out of date and no longer valid for delete, undelete, or set_norm operations" end end end |
#close ⇒ Object
Closes files associated with this index. Also saves any new deletions to disk. No other methods should be called after this has been called.
433 434 435 436 437 438 439 |
# File 'lib/ferret/index/index_reader.rb', line 433 def close() synchronize do commit() do_close() @directory.close() if @close_directory end end |
#commit ⇒ Object
Commit changes resulting from delete, undelete_all, or set_norm operations
- raises
-
IOError
407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 |
# File 'lib/ferret/index/index_reader.rb', line 407 def commit() synchronize do if @has_changes if @directory_owner @directory.synchronize do # in- & inter-process sync commit_lock = @directory.make_lock(IndexWriter::COMMIT_LOCK_NAME) commit_lock.while_locked do do_commit() @segment_infos.write(@directory) end end if (@write_lock != nil) @write_lock.release() # release write lock @write_lock = nil end else do_commit() end end @has_changes = false end end |
#delete(doc_num) ⇒ Object
Deletes the document numbered doc_num
. Once a document is deleted it will not appear in TermDocEnum or TermPostitions enumerations. Attempts to read its field with the @link #documentend method will result in an error. The presence of this document may still be reflected in the @link #docFreqendstatistic, though this will be corrected eventually as the index is further modified.
359 360 361 362 363 364 365 366 |
# File 'lib/ferret/index/index_reader.rb', line 359 def delete(doc_num) synchronize do acquire_write_lock() if @directory_owner do_delete(doc_num) @has_changes = true end return 1 end |
#delete_docs_with_term(term) ⇒ Object
Deletes all documents containing term
. This is useful if one uses a document field to hold a unique ID string for the document. Then to delete such a document, one merely constructs a term with the appropriate field and the unique ID string as its text and passes it to this method. Returns the number of documents deleted. See #delete for information about when this deletion will become effective.
380 381 382 383 384 385 386 387 388 389 390 391 392 393 |
# File 'lib/ferret/index/index_reader.rb', line 380 def delete_docs_with_term(term) docs = term_docs_for(term) if (docs == nil) then return 0 end n = 0 begin while (docs.next?) delete(docs.doc) n += 1 end ensure docs.close() end return n end |
#deleted?(n) ⇒ Boolean
Returns true if document n has been deleted
218 219 220 |
# File 'lib/ferret/index/index_reader.rb', line 218 def deleted?(n) raise NotImplementedError end |
#do_delete(doc_num) ⇒ Object
Implements deletion of the document numbered doc_num
. Applications should call @link #delete(int)endor @link #delete(Term)end.
370 371 372 |
# File 'lib/ferret/index/index_reader.rb', line 370 def do_delete(doc_num) raise NotImplementedError end |
#do_set_norm(doc, field, value) ⇒ Object
Implements set_norm in subclass.
255 256 257 |
# File 'lib/ferret/index/index_reader.rb', line 255 def do_set_norm(doc, field, value) raise NotImplementedError end |
#doc_freq(t) ⇒ Object
Returns the number of documents containing the term t
.
273 274 275 |
# File 'lib/ferret/index/index_reader.rb', line 273 def doc_freq(t) raise NotImplementedError end |
#get_document(n) ⇒ Object
Returns the stored fields of the n
<sup>th</sup> Document
in this index.
195 196 197 |
# File 'lib/ferret/index/index_reader.rb', line 195 def get_document(n) raise NotImplementedError end |
#get_document_with_term(term) ⇒ Object
Returns the first document with the term term
. This is useful, for example, if we are indexing rows from a database. We can store the id of each row in a field in the index and use this method get the document by the id. Hence, only one document is returned.
term: The term we are searching for.
205 206 207 208 209 210 211 212 213 214 215 |
# File 'lib/ferret/index/index_reader.rb', line 205 def get_document_with_term(term) docs = term_docs_for(term) if (docs == nil) then return nil end document = nil begin document = get_document(docs.doc) if docs.next? ensure docs.close() end return document end |
#get_norms(field, bytes = nil, offset = nil) ⇒ Object
Returns the byte-encoded normalization factor for the named field of every document. This is used by the search code to score documents.
See Field#boost
231 232 233 |
# File 'lib/ferret/index/index_reader.rb', line 231 def get_norms(field, bytes=nil, offset=nil) raise NotImplementedError end |
#get_term_vector(doc_number, field) ⇒ Object
Return a term vector for the specified document and field. The returned vector contains terms and frequencies for the terms in the specified field of this document, if the field had the storeTermVector flag set. If termvectors had been stored with positions or offsets, a TermDocPosEnumVector is returned.
- doc_number
-
document for which the term vector is returned
- field
-
field for which the term vector is returned.
- returns
-
term vector May be nil if field does not exist in the specified document or term vector was not stored.
- raises
-
IOError if index cannot be accessed
See Field.TermVector
165 166 167 |
# File 'lib/ferret/index/index_reader.rb', line 165 def get_term_vector(doc_number, field) raise NotImplementedError end |
#get_term_vectors(doc_number) ⇒ Object
Return an array of term vectors for the specified document. The array contains a vector for each vectorized field in the document. Each vector contains terms and frequencies for all terms in a given vectorized field. If no such fields existed, the method returns nil. The term vectors that are returned my either be of type TermFreqVector or of type TermDocPosEnumVector if positions or offsets have been stored.
- doc_number
-
document for which term vectors are returned
- returns
-
array of term vectors. May be nil if no term vectors have been stored for the specified document.
- raises
-
IOError if index cannot be accessed
See Field.TermVector
147 148 149 |
# File 'lib/ferret/index/index_reader.rb', line 147 def get_term_vectors(doc_number) raise NotImplementedError end |
#has_deletions? ⇒ Boolean
Returns true if any documents have been deleted
223 224 225 |
# File 'lib/ferret/index/index_reader.rb', line 223 def has_deletions?() raise NotImplementedError end |
#latest? ⇒ Boolean
Returns true if the reader is reading from the latest version of the index.
349 350 351 |
# File 'lib/ferret/index/index_reader.rb', line 349 def latest?() SegmentInfos.read_current_version(@directory) == @segment_infos.version() end |
#max_doc ⇒ Object
Returns one greater than the largest possible document number.
This may be used to, e.g., determine how big to allocate an array which will have an element for every document number in an index.
189 190 191 |
# File 'lib/ferret/index/index_reader.rb', line 189 def max_doc() raise NotImplementedError end |
#num_docs ⇒ Object
Returns the number of documents in this index.
181 182 183 |
# File 'lib/ferret/index/index_reader.rb', line 181 def num_docs() raise NotImplementedError end |
#set_norm(doc, field, value) ⇒ Object
Expert: Resets the normalization factor for the named field of the named document. The norm represents the product of the field’s Field#boost and its Similarity#length_norm length normalization. Thus, to preserve the length normalization values when resetting this, one should base the new value upon the old.
See #get_norms See Similarity#decode_norm
243 244 245 246 247 248 249 250 251 252 |
# File 'lib/ferret/index/index_reader.rb', line 243 def set_norm(doc, field, value) synchronize do value = Similarity.encode_norm(value) if value.is_a? Float if(@directory_owner) acquire_write_lock() end do_set_norm(doc, field, value) @has_changes = true end end |
#term_docs ⇒ Object
Returns an unpositioned TermDocEnum enumerator.
293 294 295 |
# File 'lib/ferret/index/index_reader.rb', line 293 def term_docs() raise NotImplementedError end |
#term_docs_for(term) ⇒ Object
Returns an enumeration of all the documents which contain term
. For each document, the document number, the frequency of the term in that document is also provided, for use in search scoring. Thus, this method implements the mapping:
Term => <doc_num, freq><sup>*</sup>
The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.
286 287 288 289 290 |
# File 'lib/ferret/index/index_reader.rb', line 286 def term_docs_for(term) term_docs = term_docs() term_docs.seek(term) return term_docs end |
#term_positions ⇒ Object
Returns an unpositioned @link TermDocPosEnumendenumerator.
316 317 318 |
# File 'lib/ferret/index/index_reader.rb', line 316 def term_positions() raise NotImplementedError end |
#term_positions_for(term) ⇒ Object
Returns an enumeration of all the documents which contain term
. For each document, in addition to the document number and frequency of the term in that document, a list of all of the ordinal positions of the term in the document is available. Thus, this method implements the mapping:
Term => <doc_num, freq, < pos<sub>1</sub>, pos<sub>2</sub>, ...
pos<sub>freq-1</sub> > > <sup>*</sup>
This positional information faciliates phrase and proximity searching. The enumeration is ordered by document number. Each document number is greater than all that precede it in the enumeration.
309 310 311 312 313 |
# File 'lib/ferret/index/index_reader.rb', line 309 def term_positions_for(term) term_positions = term_positions() term_positions.seek(term) return term_positions end |
#terms ⇒ Object
Returns an enumeration of all the terms in the index. Each term is greater than all that precede it in the enumeration.
261 262 263 |
# File 'lib/ferret/index/index_reader.rb', line 261 def terms() raise NotImplementedError end |
#terms_from(t) ⇒ Object
Returns an enumeration of all terms after a given term.
Each term is greater than all that precede it in the enumeration.
268 269 270 |
# File 'lib/ferret/index/index_reader.rb', line 268 def terms_from(t) raise NotImplementedError end |
#undelete_all ⇒ Object
Undeletes all documents currently marked as deleted in this index.
396 397 398 399 400 401 402 |
# File 'lib/ferret/index/index_reader.rb', line 396 def undelete_all() synchronize do acquire_write_lock() if @directory_owner do_undelete_all() @has_changes = true end end |