Rika

Rika is a JRuby wrapper for the Apache Tika Java library, which extracts text and metadata from files and resources of many different formats.

Rika can be used as a library in your Ruby code, or on the command line.

For class and method level documentation, please use YARD. You can gem install yard, then run yard doc from the project root, and then open the doc/index.html file in a browser.

Requirements

This gem only works with JRuby.
The Apache Tika jar file must be installed on your system. See the Installation section below for more information.

Rika currently supports some basic and commonly used functions of Tika. Since it runs on JRuby, the Tika library's Java methods can be called directly from Ruby code for more advanced needs. See the Other Tika Resources section of this document for alternatives to Rika that may suit more demanding needs.

Rika can be used either as a gem in your own Ruby project, or on the command line using the provided executable.

Usage in Your Ruby Code

[!IMPORTANT]
It is necessary to call Rika.init before using Rika. This is because the loading of the Tika library has been put in an init method, rather than at load time, so that 'jar file not found or specified' errors do not prevent your application from loading. If you forget to call Rika.init, you may see seemingly unrelated error messages.

As a convenience, the Rika.init method is called automatically when you call the Rika module methods. However, if you access other Rika classes and methods, init may not have been called yet, so you should call it yourself.

The Rika parse method returns a Rika::ParseResult object that contains the parsed text and various pieces of metadata. The ParseResult class' main methods are:

content - the parsed text
metadata - a hash of metadata key/value pairs
content_type - the content type of the parsed data, e.g. "text/plain; charset=UTF-8"
language - the language of the parsed data, e.g. "en"
data_source - the data source, either a filespec or a URL

For example:

require 'rika'

parse_result = Rika.parse('x.pdf') # returns a Rika::ParseResult object
parse_result.content               # string containing all content text
parse_result.text                  # 'text' is an alias for 'content'
parse_result.metadata              # hash containing the document metadata
parse_result.content_type          # e.g. "application/pdf"
parse_result.language              # e.g. "en"
parse_result.data_source           # e.g. "x.pdf"

A URL can be used instead of a filespec wherever a data source is specified:

parse_result = Rika.parse('https://github.com/keithrbennett/rika')

The Rika module also has the following methods:

Rika.language("magnifique") # => "fr"
Rika.tika_version           # => "2.9.0"

Command Line Executable Usage

Rika can also be used on the command line using the rika executable. For example, the simplest form is to simply specify one or more filespecs or URL's as arguments:

rika x.pdf https://github.com/keithrbennett/rika

Here is the help text:

Rika v2.0.2 (Tika v2.9.0) - https://github.com/keithrbennett/rika

Usage: rika [options] <file or url> [...file or url...]
Output formats are: [a]wesome_print, [t]o_s, [i]nspect, [j]son), [J] for pretty json, and [y]aml.
If a format contains two letters, the first will be used for metadata, the second for text.
Values for the text, metadata, and as_array boolean options may be specified as follows:
  Enable:  +, true,  yes, [empty]
  Disable: -, false, no, [long form option with no- prefix, e.g. --no-metadata]

    -f, --format FORMAT              Output format (default: at)
    -m, --[no-]metadata [FLAG]       Output metadata (default: true)
    -t, --[no-]text [FLAG]           Output text (default: true)
    -k, --[no-]key-sort [FLAG]       Sort metadata keys case insensitively (default: true)
    -s, --[no-]source [FLAG]         Output document source file or URL (default: false)
    -a, --[no-]as-array [FLAG]       Output all parsed results as an array (default: false)
    -v, --version                    Output version
    -h, --help                       Output help

Outputting Only Metadata or Only Parsed Text

The default setting is to output both metadata and text. To disable either, use the -m or -t options with a disabling flag, e.g. -m-, -m false, -m no, or --no-metadata to disable metadata.

Outputting the Document Source Identifier (Filespec or URL)

There are many times when it is useful to know the source of the document. For example, if you are processing a large number of documents, you may want to know which document a particular piece of output came from.

The document source identifier is output by default. To disable it, use the -s option with a disabling flag, e.g. -s-, -s false, -s no, or --no-source.

Output Formats

The -f option can be used to specify the output format. The default is at, which means that the metadata will be output in awesome_print format, and the text will be output using to_s (i.e. without any changes to the parsed string).

If a single argument to -f is specified, it will be used for both metadata and text. If two arguments are specified, the first will be used for metadata and the second for the parsed text.

Sorting of Metadata Keys

By default, metadata keys will be sorted case insensitively. To disable this, use the -k option with a disabling flag, i.e. -k-, -k false, -k no, or --no-key-sort.

The case insensitivity is implemented by using String#downcase. This may not sort correctly on some non-English systems.

Specifying Command Line Options in the RIKA_OPTIONS Environment Variable

If you find yourself using the same options over and over again, you can put them in the RIKA_OPTIONS environment variable. For example, if the default behavior of sorting keys does not work for your language, you can disable it for all invocations of the rika command by specifying -k- in the RIKA_OPTIONS environment variable.

Machine Readable Data Support

If both metadata and text are output, and the same output format is used for both, and that format is JSON (plain or "pretty") or YAML, then the output per document will be a single JSON or YAML hash representation containing both the metadata and the text (whose keys are "metadata" and "text"). This enables piping the results of multiple documents to a file or to another program that can use it as a data source. In addition, when processing multiple files, this streaming approach will be more efficient than calling Rika separately for each file, since each invocation of the rika command requires starting up a Java Virtual Machine.

If the -a (--as-array) option is specified, then the output will be an array of such hashes, one for each file. This enables the output to be used as a data source for programs that can process an array of hashes, e.g. for analysis.

For example, here is an example of how to use Rika and rexe to get a tally of content types for a set of documents, sorted by content type:

$ rika -t- -s- -fy -a spec/fixtures/* | \
  rexe -iy -oa -mb "map { |r| r['metadata']['Content-Type'] }.tally.sort.to_h"
{
                                                         "application/msword" => 1,
                                                   "application/octet-stream" => 1,
                                                            "application/pdf" => 1,
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document" => 1,
                                                                 "image/jpeg" => 2,
                                             "text/plain; charset=ISO-8859-1" => 1,
                                                  "text/plain; charset=UTF-8" => 6,
                                          "text/x-matlab; charset=ISO-8859-1" => 1
}

Here is a breakdown of the above command:

rika
- -t- suppresses the output of text
- -s- suppresses the output of the source identifier
- -fy outputs the data in YAML format.
- -a option causes the output to be an array of hashes, one for each file
rexe
- -iy indicates that the input is YAML
- -oa indicates that the output should be done using awesome_print/amazing_print
- -mb indicates that all input should be ingested as a single string ("b" for "big string", as opposed to streamed)
Ruby code passed to rexe
- map is called on the array to extract the content type from each parsed document hash
- tally is called on the resulting array to get the count of each content type
- sort is called on the hash to sort it by key (content type) and return an array of 2-element arrays
- to_h is called on the array of 2-element arrays to convert it back to a hash

Here is another example that prints out the 5 most common words in all the parsed text, and their counts, as "pretty" JSON:

$ rika -m- spec/fixtures/* | \
rexe -in -oJ -mb 'downcase \
  .split \
  .tally \
  .sort_by { |word, count| [-count, word] }
  .first(5) \
  .to_h'

{
  "the": 35,
  "to": 30,
  "woods": 25,
  "i": 25,
  "and": 25
}

Installation

Install JRuby if you don't already have it. Ruby version managers such as rvm and rbenv can simplify this process.
Download the Apache Tika jar file from http://tika.apache.org/download.html (look for the "tika-app" jar file). Put it in a place that makes sense for your system, such as /usr/local/lib.
Configure the TIKA_JAR_FILESPEC environment variable to point to the Tika jar file. For example, if you are using tika-app-2.9.0.jar, and put the jar file in `/opt/jars', then the setting of the environment variable should look like this:

  export TIKA_JAR_FILESPEC=/opt/jars/tika-app-2.9.0.jar

You can put this in your .bashrc or .zshrc file to make it persistent.

Install the gem:

  gem install rika

or, if you're using bundler, add this to your Gemfile:

  gem 'rika'

and then run bundle install.

Verify that it works by running (as an example) rika -m https://www.github.com. You should see key/value pairs representing the metadata of the Github home page.

This gem has been tested with JRuby managed by rvm. It should work with other Ruby version managers and without any version manager at all, but those configurations have not been tested.

Other Tika Resources

The Apache Tika wiki is at https://cwiki.apache.org/confluence/display/tika.
Tika also provides another jar file containing a RESTful server that you can run on the command line. You can download this server jar from http://tika.apache.org/download.html (look for the "tika-server-standard" jar file). See the "Running the Tika Server as a Jar file" section of https://cwiki.apache.org/confluence/display/TIKA/TikaServer for more information.
@chrismattman and others have provided a "tika_python" Python library and CLI that interfaces with the Tika server.

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request