Class: XapianFu::XapianDb
Overview
The XapianFu::XapianDb encapsulates a Xapian database, handling setting up stemmers, stoppers, query parsers and such. This is the core of XapianFu.
Opening and creating the database
The :dir option specified where the xapian database is to be read from and written to. Without this, an in-memory Xapian database will be used. By default, the on-disk database will not be created if it doesn’t already exist. See the :create option.
Setting the :create option to true will allow XapianDb to create a new Xapian database on-disk. If one already exists, it is just opened. The default is false.
Setting the :overwrite option to true will force XapianDb to wipe the current on-disk database and start afresh. The default is false.
Setting the :type option to either :glass or :chert will force that database backend, if supported. Leave as nil to auto-detect existing databases and create new databases with the library default (recommended). Requires xapian >=1.4
db = XapianDb.new(:dir => '/tmp/mydb', :create => true)
Language, Stemmers and Stoppers
The :language option specifies the default document language, and controls the default type of stemmer and stopper that will be used when indexing. The stemmer and stopper can be overridden with the :stemmer and stopper options.
The :language, :stemmer and :stopper options can be set to one of of the following: :danish, :dutch, :english, :finnish, :french, :german, :hungarian, :italian, :norwegian, :portuguese, :romanian, :russian, :spanish, :swedish, :turkish. Set it to false to specify none.
There are more stoppers available than stemmers. See lib/xapian_fu/stopwords/*.txt for a complete list.
The default for all is :english.
db = XapianDb.new(:language => :italian, :stopper => false)
The :stopper_strategy option specifies the default stop strategy that will be used when indexing and can be: :none, :all or :stemmed. Defaults to :stemmed
Spelling suggestions
The :spelling option controls generation of a spelling dictionary during indexing and its use during searches. When enabled, Xapian will build a dictionary of words for the database whilst indexing documents and will enable spelling suggestion by default for searches. Building the dictionary will impact indexing performance and database size. It is enabled by default. See the search section for information on getting spelling correction information during searches.
Fields and values
The :store option specifies which document fields should be stored in the database. By default, fields are only indexed - the original values cannot be retrieved.
The :sortable option specifies which document fields will be available for sorting results on. This is really just does the same thing as :store and is just available to be explicit.
The :collapsible option specifies which document fields can be used to group (“collapse”) results. This also just does the same thing as :store and is just available to be explicit.
A more complete way of defining fields is available:
XapianDb.new(:fields => { :title => { :type => String },
:slug => { :type => String, :index => false },
:created_at => { :type => Time, :store => true },
:votes => { :type => Fixnum, :store => true },
})
XapianFu will use the :type option when instantiating a store value, so you’ll get back a Time object rather than the result of Time’s to_s method as is the default. Defining the type for numerical classes (such as Time, Fixnum and Bignum) allows XapianFu to to store them on-disk in a much more efficient way, and sort them efficiently (without having to resort to storing leading zeros or anything like that).
Indexing options
If :index is false, then the field will not be tokenized, or stemmed or stopped. It will only be searchable by its entire exact contents. Useful for fields that only exact matches will make sense for, like slugs, identifiers or keys.
If :index is true (the default) then the field will be tokenized, stemmed and stopped twice, once with the field name and once without. This allows you to do both search like “name:lily” and simply “lily”, but it does require that the full text of the field content is indexed twice and will increase the size of your index on-disk.
If you know you will never need to search the field using its field name, then you can set :index to :without_field_names and only one tokenization pass will be done, without the field names as token prefixes.
If you know you will only ever search the field using its field name, then you can set :index to :with_field_names_only and only one tokenization pass will be done, with only the fieldnames as token prefixes.
Term Weights
The :weights option accepts a Proc or Lambda that sets custom term weights.
Your function will receive the term key and value and the full list of fields, and should return an integer weight to be applied for that term when the document is indexed.
In this example,
XapianDb.new(:weights => Proc.new do |key, value, fields|
return 10 if fields.keys.include?('culturally_important')
return 3 if key == 'title'
1
end)
terms in the title will be weighted three times greater than other terms, and all terms in ‘culturally important’ items will weighted 10 times more.
Instance Attribute Summary collapse
-
#boolean_fields ⇒ Object
readonly
An array of fields that will be treated as boolean terms.
-
#db_flag ⇒ Object
readonly
:nodoc:.
-
#dir ⇒ Object
readonly
Path to the on-disk database.
-
#field_options ⇒ Object
readonly
Returns the value of attribute field_options.
-
#field_weights ⇒ Object
readonly
Returns the value of attribute field_weights.
-
#fields ⇒ Object
readonly
An hash of field names and their types.
-
#fields_with_field_names_only ⇒ Object
readonly
An array of fields to be indexed only with their field names.
-
#fields_without_field_names ⇒ Object
readonly
An array of fields to be indexed without their field names.
-
#index_positions ⇒ Object
readonly
True if term positions will be stored.
-
#language ⇒ Object
readonly
The default document language.
-
#sortable_fields ⇒ Object
readonly
Returns the value of attribute sortable_fields.
-
#spelling ⇒ Object
readonly
Whether this db will generate a spelling dictionary during indexing.
-
#stopper_strategy ⇒ Object
The default stopper strategy.
-
#store_values ⇒ Object
readonly
An array of the fields that will be stored in the Xapian.
-
#unindexed_fields ⇒ Object
readonly
An array of fields that will not be indexed.
-
#weights_function ⇒ Object
Returns the value of attribute weights_function.
Instance Method Summary collapse
-
#add_doc(doc) ⇒ Object
(also: #<<)
Short-cut to documents.add.
-
#add_synonym(term, synonym) ⇒ Object
Add a synonym to the database.
-
#close ⇒ Object
Closes the database.
-
#documents ⇒ Object
The XapianFu::XapianDocumentsAccessor for this database.
-
#flush ⇒ Object
Flush any changes to disk and reopen the read-only database.
-
#initialize(options = { }) ⇒ XapianDb
constructor
A new instance of XapianDb.
-
#ro ⇒ Object
The read-only Xapian::Database.
-
#rw ⇒ Object
The writable Xapian::WritableDatabase.
-
#search(q, options = {}) ⇒ Object
Conduct a search on the Xapian database, returning an array of XapianFu::XapianDoc objects for the matches wrapped in a XapianFu::ResultSet.
- #serialize_value(field, value, type = nil) ⇒ Object
-
#size ⇒ Object
The number of docs in the Xapian database.
-
#stemmer ⇒ Object
Return a new stemmer object for this database.
-
#stopper ⇒ Object
The stopper object for this database.
-
#transaction(flush_on_commit = true) ⇒ Object
Run the given block in a XapianDB transaction.
- #unserialize_value(field, value, type = nil) ⇒ Object
Constructor Details
#initialize(options = { }) ⇒ XapianDb
Returns a new instance of XapianDb.
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/xapian_fu/xapian_db.rb', line 185 def initialize( = { } ) = { :index_positions => true, :spelling => true }.merge() @dir = [:dir] @index_positions = [:index_positions] @db_flag = Xapian::DB_OPEN @db_flag = Xapian::DB_CREATE_OR_OPEN if [:create] @db_flag = Xapian::DB_CREATE_OR_OVERWRITE if [:overwrite] case [:type] when :glass raise XapianFuError.new("type glass not recognised") unless defined?(Xapian::DB_BACKEND_GLASS) @db_flag |= Xapian::DB_BACKEND_GLASS when :chert raise XapianFuError.new("type chert not recognised") unless defined?(Xapian::DB_BACKEND_CHERT) @db_flag |= Xapian::DB_BACKEND_CHERT when nil # use library defaults else raise XapianFuError.new("type #{@options[:type].inspect} not recognised") end @tx_mutex = Mutex.new @language = .fetch(:language, :english) @stemmer = .fetch(:stemmer, @language) @stopper = .fetch(:stopper, @language) @stopper_strategy = .fetch(:stopper_strategy, :stemmed) = {} setup_fields([:fields]) @store_values << [:store] @store_values << [:sortable] @store_values << [:collapsible] @store_values = @store_values.flatten.uniq.compact @spelling = [:spelling] @weights_function = [:weights] end |
Instance Attribute Details
#boolean_fields ⇒ Object (readonly)
An array of fields that will be treated as boolean terms
175 176 177 |
# File 'lib/xapian_fu/xapian_db.rb', line 175 def boolean_fields @boolean_fields end |
#db_flag ⇒ Object (readonly)
:nodoc:
158 159 160 |
# File 'lib/xapian_fu/xapian_db.rb', line 158 def db_flag @db_flag end |
#dir ⇒ Object (readonly)
Path to the on-disk database. Nil if in-memory database
157 158 159 |
# File 'lib/xapian_fu/xapian_db.rb', line 157 def dir @dir end |
#field_options ⇒ Object (readonly)
Returns the value of attribute field_options.
179 180 181 |
# File 'lib/xapian_fu/xapian_db.rb', line 179 def end |
#field_weights ⇒ Object (readonly)
Returns the value of attribute field_weights.
181 182 183 |
# File 'lib/xapian_fu/xapian_db.rb', line 181 def field_weights @field_weights end |
#fields ⇒ Object (readonly)
An hash of field names and their types
166 167 168 |
# File 'lib/xapian_fu/xapian_db.rb', line 166 def fields @fields end |
#fields_with_field_names_only ⇒ Object (readonly)
An array of fields to be indexed only with their field names
172 173 174 |
# File 'lib/xapian_fu/xapian_db.rb', line 172 def fields_with_field_names_only @fields_with_field_names_only end |
#fields_without_field_names ⇒ Object (readonly)
An array of fields to be indexed without their field names
170 171 172 |
# File 'lib/xapian_fu/xapian_db.rb', line 170 def fields_without_field_names @fields_without_field_names end |
#index_positions ⇒ Object (readonly)
True if term positions will be stored
162 163 164 |
# File 'lib/xapian_fu/xapian_db.rb', line 162 def index_positions @index_positions end |
#language ⇒ Object (readonly)
The default document language. Used for setting up stoppers and stemmers.
164 165 166 |
# File 'lib/xapian_fu/xapian_db.rb', line 164 def language @language end |
#sortable_fields ⇒ Object (readonly)
Returns the value of attribute sortable_fields.
178 179 180 |
# File 'lib/xapian_fu/xapian_db.rb', line 178 def sortable_fields @sortable_fields end |
#spelling ⇒ Object (readonly)
Whether this db will generate a spelling dictionary during indexing
177 178 179 |
# File 'lib/xapian_fu/xapian_db.rb', line 177 def spelling @spelling end |
#stopper_strategy ⇒ Object
The default stopper strategy
183 184 185 |
# File 'lib/xapian_fu/xapian_db.rb', line 183 def stopper_strategy @stopper_strategy end |
#store_values ⇒ Object (readonly)
An array of the fields that will be stored in the Xapian
160 161 162 |
# File 'lib/xapian_fu/xapian_db.rb', line 160 def store_values @store_values end |
#unindexed_fields ⇒ Object (readonly)
An array of fields that will not be indexed
168 169 170 |
# File 'lib/xapian_fu/xapian_db.rb', line 168 def unindexed_fields @unindexed_fields end |
#weights_function ⇒ Object
Returns the value of attribute weights_function.
180 181 182 |
# File 'lib/xapian_fu/xapian_db.rb', line 180 def weights_function @weights_function end |
Instance Method Details
#add_doc(doc) ⇒ Object Also known as: <<
Short-cut to documents.add
250 251 252 |
# File 'lib/xapian_fu/xapian_db.rb', line 250 def add_doc(doc) documents.add(doc) end |
#add_synonym(term, synonym) ⇒ Object
Add a synonym to the database.
If you want to search with synonym support, remember to add the option:
db.search("foo", :synonyms => true)
Note that in-memory databases don’t support synonyms.
264 265 266 |
# File 'lib/xapian_fu/xapian_db.rb', line 264 def add_synonym(term, synonym) rw.add_synonym(term, synonym) end |
#close ⇒ Object
Closes the database.
400 401 402 403 404 405 406 407 408 |
# File 'lib/xapian_fu/xapian_db.rb', line 400 def close raise ConcurrencyError if @tx_mutex.locked? @rw.close if @rw @rw = nil @ro.close if @ro @ro = nil end |
#documents ⇒ Object
The XapianFu::XapianDocumentsAccessor for this database
245 246 247 |
# File 'lib/xapian_fu/xapian_db.rb', line 245 def documents @documents_accessor ||= XapianDocumentsAccessor.new(self) end |
#flush ⇒ Object
Flush any changes to disk and reopen the read-only database. Raises ConcurrencyError if a transaction is in process
393 394 395 396 397 |
# File 'lib/xapian_fu/xapian_db.rb', line 393 def flush raise ConcurrencyError if @tx_mutex.locked? rw.flush ro.reopen end |
#ro ⇒ Object
The read-only Xapian::Database
235 236 237 |
# File 'lib/xapian_fu/xapian_db.rb', line 235 def ro @ro ||= setup_ro_db end |
#rw ⇒ Object
The writable Xapian::WritableDatabase
230 231 232 |
# File 'lib/xapian_fu/xapian_db.rb', line 230 def rw @rw ||= setup_rw_db end |
#search(q, options = {}) ⇒ Object
Conduct a search on the Xapian database, returning an array of XapianFu::XapianDoc objects for the matches wrapped in a XapianFu::ResultSet.
The :limit option sets how many results to return. For compatability with the will_paginate plugin, the :per_page option does the same thing (though overrides :limit). Defaults to 10.
The :page option sets which page of results to return. Defaults to 1.
The :order option specifies the stored field to order the results by (instead of the default search result weight).
The :reverse option reverses the order of the results, so lowest search weight first (or lowest stored field value first).
The :collapse option specifies which stored field value to collapse (group) the results on. Works a bit like the SQL GROUP BY behaviour
The :spelling option controls whether spelling suggestions will be made for queries. It defaults to whatever the database spelling setting is (true by default). When enabled, spelling suggestions are available using the XapianFu::ResultSet corrected_query method.
The :check_at_least option controls how many documents will be sampled. This allows for accurate page and facet counts. Specifying the special value of :all will make Xapian sample every document in the database. Be aware that this can hurt your query performance.
The :query_builder option allows you to pass a proc that will return the final query to be run. The proc receives the parsed query as its only argument.
The first parameter can also be :all or :nothing, to match all documents or no documents respectively.
For additional options on how the query is parsed, see XapianFu::QueryParser
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 |
# File 'lib/xapian_fu/xapian_db.rb', line 314 def search(q, = {}) defaults = { :page => 1, :reverse => false, :boolean => true, :boolean_anycase => true, :wildcards => true, :lovehate => true, :spelling => spelling, :pure_not => false } = defaults.merge() page = [:page].to_i rescue 1 page = page > 1 ? page - 1 : 0 per_page = [:per_page] || [:limit] || 10 per_page = per_page.to_i rescue 10 offset = page * per_page check_at_least = .include?(:check_at_least) ? [:check_at_least] : 0 check_at_least = self.size if check_at_least == :all qp = XapianFu::QueryParser.new({ :database => self }.merge()) query = qp.parse_query(q.is_a?(Symbol) ? q : q.to_s) if .include?(:query_builder) query = [:query_builder].call(query) end query = filter_query(query, [:filter]) if [:filter] enquiry = Xapian::Enquire.new(ro) setup_ordering(enquiry, [:order], [:reverse]) if [:collapse] enquiry.collapse_key = XapianDocValueAccessor.value_key([:collapse]) end if [:facets] spies = [:facets].inject({}) do |accum, name| accum[name] = spy = Xapian::ValueCountMatchSpy.new(XapianDocValueAccessor.value_key(name)) enquiry.add_matchspy(spy) accum end end if .include?(:posting_source) query = Xapian::Query.new(Xapian::Query::OP_AND_MAYBE, query, Xapian::Query.new([:posting_source])) end enquiry.query = query ResultSet.new(:mset => enquiry.mset(offset, per_page, check_at_least), :current_page => page + 1, :per_page => per_page, :corrected_query => qp.corrected_query, :spies => spies, :xapian_db => self ) end |
#serialize_value(field, value, type = nil) ⇒ Object
410 411 412 413 414 415 416 |
# File 'lib/xapian_fu/xapian_db.rb', line 410 def serialize_value(field, value, type = nil) if sortable_fields.include?(field) Xapian.sortable_serialise(value) else (type || fields[field] || Object).to_xapian_fu_storage_value(value) end end |
#size ⇒ Object
The number of docs in the Xapian database
240 241 242 |
# File 'lib/xapian_fu/xapian_db.rb', line 240 def size ro.doccount end |
#stemmer ⇒ Object
Return a new stemmer object for this database
220 221 222 |
# File 'lib/xapian_fu/xapian_db.rb', line 220 def stemmer StemFactory.stemmer_for(@stemmer) end |
#stopper ⇒ Object
The stopper object for this database
225 226 227 |
# File 'lib/xapian_fu/xapian_db.rb', line 225 def stopper StopperFactory.stopper_for(@stopper) end |
#transaction(flush_on_commit = true) ⇒ Object
Run the given block in a XapianDB transaction. Any changes to the Xapian database made in the block will be atomically committed at the end.
If an exception is raised by the block, all changes are discarded and the exception re-raised.
Xapian does not support multiple concurrent transactions on the same Xapian database. Any attempts at this will be serialized by XapianFu, which is not perfect but probably better than just kicking up an exception.
376 377 378 379 380 381 382 383 384 385 386 387 388 389 |
# File 'lib/xapian_fu/xapian_db.rb', line 376 def transaction(flush_on_commit = true) @tx_mutex.synchronize do begin rw.begin_transaction(flush_on_commit) yield rescue Exception => e rw.cancel_transaction ro.reopen raise e end rw.commit_transaction ro.reopen end end |
#unserialize_value(field, value, type = nil) ⇒ Object
418 419 420 421 422 423 424 |
# File 'lib/xapian_fu/xapian_db.rb', line 418 def unserialize_value(field, value, type = nil) if sortable_fields.include?(field) Xapian.sortable_unserialise(value) else (type || fields[field] || Object).from_xapian_fu_storage_value(value) end end |