Isomorfeus Ferret

Convenient and well performing document store, indexing and search.

Community and Support

At the Isomorfeus Framework Project

About this project

Isomorfeus-Ferret is a revived version of the original ferret gem created by Dave Balmain, https://github.com/dbalmain/ferret. During revival many things havbe been fixed, now all tests pass, no crashes and it successfully compiles and runs with rubys >3. Its no longer a goal to have a c library available, but instead the usage is meant as ruby gem with a c extension only.

It works on *nixes, *nuxes, *BSDs and also works on Windows and RaspberryPi.

Improvements and Changes in Version 0.14

Breaking

  • The API for LazyDocs has changed, they are read only now. LazyDoc#to_h may be used to create a hash, that may be changed and reindexed as doc.

Performance

  • LazyDoc is now truly lazy, fields are automatically retrieved. LazyDoc#load is no longer required, but may be used to preload all fields.
  • Index#each is now multiple times faster, depending on use case.

Other

  • The Index class now includes Enumerable

Improvements and Changes in Version 0.13

Breaking

  • For version 0.13 die index file format has changed and is no longer compatible with previous versions. Indexes of older versions must be recreated with 0.13 (export all data from and with previous version, import alls data with 0.13)
  • The :store option no longer accepts :compress, compression must now be specified by the separate :compress options (see below).
  • The ASCII-specific Tokenizers and Analyzers have been removed

String Encoding support

Input strings and stored fields

In versions prior 0.13 the string encoding had to match the locale string encoding. In 0.13 the dependency on the locale setting has been resolved, input strings are now correctly tokenized according to their source encoding, with positions correctly matching the input string. All Ruby string encodings are supported. When fields are stored, they are now stored with the encoding, so that when they are retrieved again, they retain the original encoding with positions matching the string in its original encoding.

Tokens, Terms, Filters and Queries

Tokens are internally converted to UTF-8, which may change their length compared to their original encoding, yet they retain position information according to the source in its original encoding. Terms are likewise stored in UTF-8 encoding. Queries are converted to UTF-8 encoding too. The benefit is, that Filters, Stemmers or anything else working with Tokens and Terms only needs to support UTF-8 encoding, greatly simplifying things and ensuring consistent query results, independent of source encoding.

Compression

Compression semantics have changed, now Brotli, BZip2 and LZ4 compression codecs are supported. - BZip2: slow compression, slow decompression, high compression ratio - Brotli: slow compression, fast decrompression, high compression ratio, recommended for general purpose. - LZ4: fast compression, fast decrompression, low compression ratio

To see performance and compression ratios rake ferret_compression_bench can be run from the cloned repo. It uses data and code within the misc/ferret_vs_others directory.

To compress a stored field the :compression option can be used with one of: :no, :brotli, :bz2 or :lz4. Example: ruby fis.add_field(:compressed_field, :store => :yes, :compression => :brotli, :term_vector => :yes)

Performance

For version 0.13.7 the performance bottle neck has been identified and removed, ferret now delivers excellent indexing perfomance on all platforms, see numbers below. On Windows performance is still not as good as on Linux, but that is equally true for Lucene and because of how the Windows filesystem works.

Documentation

The documentations is currently scattered throughout the repo.

For a quick start its best to read: https://github.com/isomorfeus/isomorfeus-ferret/blob/master/TUTORIAL.md

Further: https://github.com/isomorfeus/isomorfeus-ferret/blob/master/lib/isomorfeus/ferret/index/index.rb https://github.com/isomorfeus/isomorfeus-ferret/blob/master/lib/isomorfeus/ferret/document.rb

The query language and parser are documented here: https://github.com/isomorfeus/isomorfeus-ferret/blob/master/ext/isomorfeus_ferret_ext/frb_qparser.c

Examples can be found in the ‘test’ directory or in ‘misc/ferret_vs_others’.

Running Specs

  • clone repo
  • bundle install
  • rake

Ensure your locale is set to C.UTF-8, because the internal c tests don’t know how to handle localized output.

Benchmarks

Indexing and Searching

  • clone repo
  • bundle install
  • rake ferret_vs_others

A recent Java JDK must be installed to compile and run lucene benchmarks.

Results, Ferret 0.14.0 vs. Lucene 9.2.0, WhitespaceAnalyzer, Linux Ubuntu 22.04, FreeBSD 13.1 and Windows 10 on old Intel Core i5 from 2015, LinuxPi on RaspberryPi 400:

OS Task Ferret Lucene*
Linux Indexing 5125 docs/s 4959 docs/s
FreeBSD Indexing 4537 docs/s 3831 docs/s
Windows Indexing 2488 docs/s 2588 docs/s
LinuxPi Indexing 1200 docs/s 755 docs/s
Linux Searching 26610 queries/s 7165 queries/s
FreeBSD Searching 24167 queries/s 4288 queries/s
Windows Searching 3901 queries/s 1033 queries/s
LinuxPi Searching 6194 queries/s 785 queries/s
  Index Size 28 MB 35 MB
  • JVM Versions: OpenJDK Runtime Environment (build 18-ea+36-Ubuntu-1) (Linux) OpenJDK Runtime Environment (build 17.0.3+7-Raspbian-1deb11u1rpt1) (LinuxPi) OpenJDK Runtime Environment Temurin-18.0.1+10 (build 18.0.1+10) (Windows) OpenJDK Runtime Environment (build 17.0.2+8-1) (FreeBSD)

Storing Fields with Compression, Indexing and Retrieval

  • clone repo
  • bundle install
  • rake ferret_compression_benchmark

Results on Linux, 0.14.0, on old Intel Core i5 from 2015:

Compression Index & Store Retrieve Title Index size
none 4862 docs/s 278827 docs/s 43 MB
brotli 3559 docs/s 178170 docs/s 36 MB
bzip2 2628 docs/s 81877 docs/s 38 MB
lz4 4648 docs/s 232236 docs/s 41 MB

Future

Lots of things to do: - Bring documentation in order in a docs directory - Review code (especially for memory/stack issues, typical c issues) - Take care of ruby GVL and threading - See todo directory: https://github.com/isomorfeus/isomorfeus-ferret/tree/master/misc/todo

Any help, support much appreciated!