Class: XapianFu::XapianDb

Inherits:
Object show all
Defined in:
lib/xapian_fu/xapian_db.rb

Overview

The XapianFu::XapianDb encapsulates a Xapian database, handling setting up stemmers, stoppers, query parsers and such. This is the core of XapianFu.

Opening and creating the database

The :dir option specified where the xapian database is to be read from and written to. Without this, an in-memory Xapian database will be used. By default, the on-disk database will not be created if it doesn’t already exist. See the :create option.

Setting the :create option to true will allow XapianDb to create a new Xapian database on-disk. If one already exists, it is just opened. The default is false.

Setting the :overwrite option to true will force XapianDb to wipe the current on-disk database and start afresh. The default is false.

Setting the :type option to either :glass or :chert will force that database backend, if supported. Leave as nil to auto-detect existing databases and create new databases with the library default (recommended). Requires xapian >=1.4

db = XapianDb.new(:dir => '/tmp/mydb', :create => true)

Language, Stemmers and Stoppers

The :language option specifies the default document language, and controls the default type of stemmer and stopper that will be used when indexing. The stemmer and stopper can be overridden with the :stemmer and stopper options.

The :language, :stemmer and :stopper options can be set to one of of the following: :danish, :dutch, :english, :finnish, :french, :german, :hungarian, :italian, :norwegian, :portuguese, :romanian, :russian, :spanish, :swedish, :turkish. Set it to false to specify none.

There are more stoppers available than stemmers. See lib/xapian_fu/stopwords/*.txt for a complete list.

The default for all is :english.

db = XapianDb.new(:language => :italian, :stopper => false)

The :stopper_strategy option specifies the default stop strategy that will be used when indexing and can be: :none, :all or :stemmed. Defaults to :stemmed

Spelling suggestions

The :spelling option controls generation of a spelling dictionary during indexing and its use during searches. When enabled, Xapian will build a dictionary of words for the database whilst indexing documents and will enable spelling suggestion by default for searches. Building the dictionary will impact indexing performance and database size. It is enabled by default. See the search section for information on getting spelling correction information during searches.

Fields and values

The :store option specifies which document fields should be stored in the database. By default, fields are only indexed - the original values cannot be retrieved.

The :sortable option specifies which document fields will be available for sorting results on. This is really just does the same thing as :store and is just available to be explicit.

The :collapsible option specifies which document fields can be used to group (“collapse”) results. This also just does the same thing as :store and is just available to be explicit.

A more complete way of defining fields is available:

XapianDb.new(:fields => { :title => { :type => String },
                          :slug => { :type => String, :index => false },
                          :created_at => { :type => Time, :store => true },
                          :votes => { :type => Fixnum, :store => true },
                        })

XapianFu will use the :type option when instantiating a store value, so you’ll get back a Time object rather than the result of Time’s to_s method as is the default. Defining the type for numerical classes (such as Time, Fixnum and Bignum) allows XapianFu to to store them on-disk in a much more efficient way, and sort them efficiently (without having to resort to storing leading zeros or anything like that).

Indexing options

If :index is false, then the field will not be tokenized, or stemmed or stopped. It will only be searchable by its entire exact contents. Useful for fields that only exact matches will make sense for, like slugs, identifiers or keys.

If :index is true (the default) then the field will be tokenized, stemmed and stopped twice, once with the field name and once without. This allows you to do both search like “name:lily” and simply “lily”, but it does require that the full text of the field content is indexed twice and will increase the size of your index on-disk.

If you know you will never need to search the field using its field name, then you can set :index to :without_field_names and only one tokenization pass will be done, without the field names as token prefixes.

If you know you will only ever search the field using its field name, then you can set :index to :with_field_names_only and only one tokenization pass will be done, with only the fieldnames as token prefixes.

Term Weights

The :weights option accepts a Proc or Lambda that sets custom term weights.

Your function will receive the term key and value and the full list of fields, and should return an integer weight to be applied for that term when the document is indexed.

In this example,

XapianDb.new(:weights => Proc.new do |key, value, fields|
  return 10 if fields.keys.include?('culturally_important')
  return 3  if key == 'title'
  1
end)

terms in the title will be weighted three times greater than other terms, and all terms in ‘culturally important’ items will weighted 10 times more.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = { }) ⇒ XapianDb

Returns a new instance of XapianDb.



185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# File 'lib/xapian_fu/xapian_db.rb', line 185

def initialize( options = { } )
  @options = { :index_positions => true, :spelling => true }.merge(options)
  @dir = @options[:dir]
  @index_positions = @options[:index_positions]
  @db_flag = Xapian::DB_OPEN
  @db_flag = Xapian::DB_CREATE_OR_OPEN if @options[:create]
  @db_flag = Xapian::DB_CREATE_OR_OVERWRITE if @options[:overwrite]
  case @options[:type]
  when :glass
    raise XapianFuError.new("type glass not recognised") unless defined?(Xapian::DB_BACKEND_GLASS)
    @db_flag |= Xapian::DB_BACKEND_GLASS
  when :chert
    raise XapianFuError.new("type chert not recognised") unless defined?(Xapian::DB_BACKEND_CHERT)
    @db_flag |= Xapian::DB_BACKEND_CHERT
  when nil
    # use library defaults
  else
    raise XapianFuError.new("type #{@options[:type].inspect} not recognised")
  end
  @tx_mutex = Mutex.new
  @language = @options.fetch(:language, :english)
  @stemmer = @options.fetch(:stemmer, @language)
  @stopper = @options.fetch(:stopper, @language)
  @stopper_strategy = @options.fetch(:stopper_strategy, :stemmed)
  @field_options = {}
  setup_fields(@options[:fields])
  @store_values << @options[:store]
  @store_values << @options[:sortable]
  @store_values << @options[:collapsible]
  @store_values = @store_values.flatten.uniq.compact
  @spelling = @options[:spelling]
  @weights_function = @options[:weights]
end

Instance Attribute Details

#boolean_fieldsObject (readonly)

An array of fields that will be treated as boolean terms



175
176
177
# File 'lib/xapian_fu/xapian_db.rb', line 175

def boolean_fields
  @boolean_fields
end

#db_flagObject (readonly)

:nodoc:



158
159
160
# File 'lib/xapian_fu/xapian_db.rb', line 158

def db_flag
  @db_flag
end

#dirObject (readonly)

Path to the on-disk database. Nil if in-memory database



157
158
159
# File 'lib/xapian_fu/xapian_db.rb', line 157

def dir
  @dir
end

#field_optionsObject (readonly)

Returns the value of attribute field_options.



179
180
181
# File 'lib/xapian_fu/xapian_db.rb', line 179

def field_options
  @field_options
end

#field_weightsObject (readonly)

Returns the value of attribute field_weights.



181
182
183
# File 'lib/xapian_fu/xapian_db.rb', line 181

def field_weights
  @field_weights
end

#fieldsObject (readonly)

An hash of field names and their types



166
167
168
# File 'lib/xapian_fu/xapian_db.rb', line 166

def fields
  @fields
end

#fields_with_field_names_onlyObject (readonly)

An array of fields to be indexed only with their field names



172
173
174
# File 'lib/xapian_fu/xapian_db.rb', line 172

def fields_with_field_names_only
  @fields_with_field_names_only
end

#fields_without_field_namesObject (readonly)

An array of fields to be indexed without their field names



170
171
172
# File 'lib/xapian_fu/xapian_db.rb', line 170

def fields_without_field_names
  @fields_without_field_names
end

#index_positionsObject (readonly)

True if term positions will be stored



162
163
164
# File 'lib/xapian_fu/xapian_db.rb', line 162

def index_positions
  @index_positions
end

#languageObject (readonly)

The default document language. Used for setting up stoppers and stemmers.



164
165
166
# File 'lib/xapian_fu/xapian_db.rb', line 164

def language
  @language
end

#sortable_fieldsObject (readonly)

Returns the value of attribute sortable_fields.



178
179
180
# File 'lib/xapian_fu/xapian_db.rb', line 178

def sortable_fields
  @sortable_fields
end

#spellingObject (readonly)

Whether this db will generate a spelling dictionary during indexing



177
178
179
# File 'lib/xapian_fu/xapian_db.rb', line 177

def spelling
  @spelling
end

#stopper_strategyObject

The default stopper strategy



183
184
185
# File 'lib/xapian_fu/xapian_db.rb', line 183

def stopper_strategy
  @stopper_strategy
end

#store_valuesObject (readonly)

An array of the fields that will be stored in the Xapian



160
161
162
# File 'lib/xapian_fu/xapian_db.rb', line 160

def store_values
  @store_values
end

#unindexed_fieldsObject (readonly)

An array of fields that will not be indexed



168
169
170
# File 'lib/xapian_fu/xapian_db.rb', line 168

def unindexed_fields
  @unindexed_fields
end

#weights_functionObject

Returns the value of attribute weights_function.



180
181
182
# File 'lib/xapian_fu/xapian_db.rb', line 180

def weights_function
  @weights_function
end

Instance Method Details

#add_doc(doc) ⇒ Object Also known as: <<

Short-cut to documents.add



250
251
252
# File 'lib/xapian_fu/xapian_db.rb', line 250

def add_doc(doc)
  documents.add(doc)
end

#add_synonym(term, synonym) ⇒ Object

Add a synonym to the database.

If you want to search with synonym support, remember to add the option:

db.search("foo", :synonyms => true)

Note that in-memory databases don’t support synonyms.



264
265
266
# File 'lib/xapian_fu/xapian_db.rb', line 264

def add_synonym(term, synonym)
  rw.add_synonym(term, synonym)
end

#closeObject

Closes the database.

Raises:



400
401
402
403
404
405
406
407
408
# File 'lib/xapian_fu/xapian_db.rb', line 400

def close
  raise ConcurrencyError if @tx_mutex.locked?

  @rw.close if @rw
  @rw = nil

  @ro.close if @ro
  @ro = nil
end

#documentsObject

The XapianFu::XapianDocumentsAccessor for this database



245
246
247
# File 'lib/xapian_fu/xapian_db.rb', line 245

def documents
  @documents_accessor ||= XapianDocumentsAccessor.new(self)
end

#flushObject

Flush any changes to disk and reopen the read-only database. Raises ConcurrencyError if a transaction is in process

Raises:



393
394
395
396
397
# File 'lib/xapian_fu/xapian_db.rb', line 393

def flush
  raise ConcurrencyError if @tx_mutex.locked?
  rw.flush
  ro.reopen
end

#roObject

The read-only Xapian::Database



235
236
237
# File 'lib/xapian_fu/xapian_db.rb', line 235

def ro
  @ro ||= setup_ro_db
end

#rwObject

The writable Xapian::WritableDatabase



230
231
232
# File 'lib/xapian_fu/xapian_db.rb', line 230

def rw
  @rw ||= setup_rw_db
end

#search(q, options = {}) ⇒ Object

Conduct a search on the Xapian database, returning an array of XapianFu::XapianDoc objects for the matches wrapped in a XapianFu::ResultSet.

The :limit option sets how many results to return. For compatability with the will_paginate plugin, the :per_page option does the same thing (though overrides :limit). Defaults to 10.

The :page option sets which page of results to return. Defaults to 1.

The :order option specifies the stored field to order the results by (instead of the default search result weight).

The :reverse option reverses the order of the results, so lowest search weight first (or lowest stored field value first).

The :collapse option specifies which stored field value to collapse (group) the results on. Works a bit like the SQL GROUP BY behaviour

The :spelling option controls whether spelling suggestions will be made for queries. It defaults to whatever the database spelling setting is (true by default). When enabled, spelling suggestions are available using the XapianFu::ResultSet corrected_query method.

The :check_at_least option controls how many documents will be sampled. This allows for accurate page and facet counts. Specifying the special value of :all will make Xapian sample every document in the database. Be aware that this can hurt your query performance.

The :query_builder option allows you to pass a proc that will return the final query to be run. The proc receives the parsed query as its only argument.

The first parameter can also be :all or :nothing, to match all documents or no documents respectively.

For additional options on how the query is parsed, see XapianFu::QueryParser



314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
# File 'lib/xapian_fu/xapian_db.rb', line 314

def search(q, options = {})
  defaults = { :page => 1, :reverse => false,
    :boolean => true, :boolean_anycase => true, :wildcards => true,
    :lovehate => true, :spelling => spelling, :pure_not => false }
  options = defaults.merge(options)
  page = options[:page].to_i rescue 1
  page = page > 1 ? page - 1 : 0
  per_page = options[:per_page] || options[:limit] || 10
  per_page = per_page.to_i rescue 10
  offset = page * per_page

  check_at_least = options.include?(:check_at_least) ? options[:check_at_least] : 0
  check_at_least = self.size if check_at_least == :all

  qp = XapianFu::QueryParser.new({ :database => self }.merge(options))
  query = qp.parse_query(q.is_a?(Symbol) ? q : q.to_s)

  if options.include?(:query_builder)
    query = options[:query_builder].call(query)
  end

  query = filter_query(query, options[:filter]) if options[:filter]

  enquiry = Xapian::Enquire.new(ro)
  setup_ordering(enquiry, options[:order], options[:reverse])
  if options[:collapse]
    enquiry.collapse_key = XapianDocValueAccessor.value_key(options[:collapse])
  end
  if options[:facets]
    spies = options[:facets].inject({}) do |accum, name|
      accum[name] = spy = Xapian::ValueCountMatchSpy.new(XapianDocValueAccessor.value_key(name))
      enquiry.add_matchspy(spy)
      accum
    end
  end

  if options.include?(:posting_source)
    query = Xapian::Query.new(Xapian::Query::OP_AND_MAYBE, query, Xapian::Query.new(options[:posting_source]))
  end

  enquiry.query = query

  ResultSet.new(:mset => enquiry.mset(offset, per_page, check_at_least),
                :current_page => page + 1,
                :per_page => per_page,
                :corrected_query => qp.corrected_query,
                :spies => spies,
                :xapian_db => self
               )
end

#serialize_value(field, value, type = nil) ⇒ Object



410
411
412
413
414
415
416
# File 'lib/xapian_fu/xapian_db.rb', line 410

def serialize_value(field, value, type = nil)
  if sortable_fields.include?(field)
    Xapian.sortable_serialise(value)
  else
    (type || fields[field] || Object).to_xapian_fu_storage_value(value)
  end
end

#sizeObject

The number of docs in the Xapian database



240
241
242
# File 'lib/xapian_fu/xapian_db.rb', line 240

def size
  ro.doccount
end

#stemmerObject

Return a new stemmer object for this database



220
221
222
# File 'lib/xapian_fu/xapian_db.rb', line 220

def stemmer
  StemFactory.stemmer_for(@stemmer)
end

#stopperObject

The stopper object for this database



225
226
227
# File 'lib/xapian_fu/xapian_db.rb', line 225

def stopper
  StopperFactory.stopper_for(@stopper)
end

#transaction(flush_on_commit = true) ⇒ Object

Run the given block in a XapianDB transaction. Any changes to the Xapian database made in the block will be atomically committed at the end.

If an exception is raised by the block, all changes are discarded and the exception re-raised.

Xapian does not support multiple concurrent transactions on the same Xapian database. Any attempts at this will be serialized by XapianFu, which is not perfect but probably better than just kicking up an exception.



376
377
378
379
380
381
382
383
384
385
386
387
388
389
# File 'lib/xapian_fu/xapian_db.rb', line 376

def transaction(flush_on_commit = true)
  @tx_mutex.synchronize do
    begin
      rw.begin_transaction(flush_on_commit)
      yield
    rescue Exception => e
      rw.cancel_transaction
      ro.reopen
      raise e
    end
    rw.commit_transaction
    ro.reopen
  end
end

#unserialize_value(field, value, type = nil) ⇒ Object



418
419
420
421
422
423
424
# File 'lib/xapian_fu/xapian_db.rb', line 418

def unserialize_value(field, value, type = nil)
  if sortable_fields.include?(field)
    Xapian.sortable_unserialise(value)
  else
    (type || fields[field] || Object).from_xapian_fu_storage_value(value)
  end
end