Module: Rika

Defined in:: lib/rika.rb,
lib/rika/parser.rb,
lib/rika/version.rb,
lib/rika/formatters.rb,
lib/rika/tika_loader.rb,
lib/rika/parse_result.rb

Overview

Requires the Tika jar file, either from the default location (packaged with this gem) or from an override specified in the TIKA_JAR_FILESPEC environment variable.

Defined Under Namespace

Classes: Formatters, ParseResult, Parser, TikaLoadError, TikaLoader

Constant Summary collapse

PROJECT_URL =

'https://github.com/keithrbennett/rika'

VERSION =

'2.0.0'

Class Method Summary collapse

.init ⇒ Module

Loads the Tika jar file and imports the needed Java classes.
.language(text) ⇒ String

Language of passed text, as 2-character ISO 639-1 code.
.parse(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ ParseResult

Gets a ParseResult from parsing a document.
.parse_content(data_source, max_content_length: -1)) ⇒ Parser deprecated Deprecated.

Instead, get a ParseResult and access the content field
.parse_content_and_metadata(data_source, max_content_length: -1)) ⇒ Array<String,Hash> deprecated Deprecated.

Instead, get a ParseResult and access the content and metadata fields.
.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)) ⇒ Hash deprecated Deprecated.

Instead, use a ParseResult or its to_h method.
.parse_metadata(data_source, max_content_length: -1)) ⇒ Object deprecated Deprecated.

Instead, get a ParseResult and access the metadata field
.raise_unless_jruby ⇒ Object

Raise an error if not running under JRuby.
.tika_language_detector ⇒ Detector

Tika detector.
.tika_version ⇒ String

Version of loaded Tika jar file.

Class Method Details

.init ⇒ `Module`

Loads the Tika jar file and imports the needed Java classes.

Returns:

(Module) —

the Rika module, for chaining

# File 'lib/rika.rb', line 16

def self.init
  return if @initialized

  Rika.raise_unless_jruby

  Rika::TikaLoader.require_tika
  import java.io.FileInputStream
  import java.net.URL
  import org.apache.tika.Tika
  import org.apache.tika.detect.DefaultDetector
  import org.apache.tika.io.TikaInputStream
  import org.apache.tika.langdetect.optimaize.OptimaizeLangDetector
  import org.apache.tika.language.detect.LanguageDetector
  import org.apache.tika.language.detect.LanguageResult
  import org.apache.tika.metadata.Metadata

  @initialized = true
  self
end

.language(text) ⇒ `String`

Returns language of passed text, as 2-character ISO 639-1 code.

Parameters:

text (String) —

text to detect language of

Returns:

(String) —

language of passed text, as 2-character ISO 639-1 code

# File 'lib/rika.rb', line 57

def self.language(text)
  init
  tika_language_detector.detect(text.to_java_string).get_language
end

.parse(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ `ParseResult`

Gets a ParseResult from parsing a document.

Parameters:

data_source (String) —

file path or HTTP(s) URL
key_sort (Boolean) (defaults to: true) —

whether to sort the keys in the metadata hash, defaults to true
max_content_length (Integer) (defaults to: -1,) —

maximum content length to return, defaults to all
detector (Detector) (defaults to: DefaultDetector.new) —

Tika detector, defaults to DefaultDetector

Returns:

(ParseResult)

# File 'lib/rika.rb', line 43

def self.parse(data_source, key_sort: true, max_content_length: -1, detector: DefaultDetector.new)
  init
  parser = Parser.new(data_source, key_sort: key_sort, max_content_length: max_content_length, detector: detector)
  parser.parse
end

.parse_content(data_source, max_content_length: -1)) ⇒ `Parser`

Deprecated.

Instead, get a ParseResult and access the content field

Returns parser for resource at specified location.

Parameters:

data_source (String) —

file path or HTTP URL

Returns:

(Parser) —

parser for resource at specified location

# File 'lib/rika.rb', line 86

def self.parse_content(data_source, max_content_length: -1)
  init
  parse(data_source, max_content_length: max_content_length).content
end

.parse_content_and_metadata(data_source, max_content_length: -1)) ⇒ `Array<String,Hash>`

Deprecated.

Instead, get a ParseResult and access the content and metadata fields.

Returns content and metadata of file at specified location.

Parameters:

data_source (String) —

file path or HTTP URL

Returns:

(Array<String,Hash>) —

content and metadata of file at specified location

# File 'lib/rika.rb', line 66

def self.parse_content_and_metadata(data_source, max_content_length: -1)
  init
  result = parse(data_source, max_content_length: max_content_length)
  [result.content, result.metadata]
end

.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)) ⇒ `Hash`

Deprecated.

Instead, use a ParseResult or its to_h method.

Returns content and metadata of file at specified location.

Parameters:

data_source (String) —

file path or HTTP URL

Returns:

(Hash) —

content and metadata of file at specified location

# File 'lib/rika.rb', line 76

def self.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)
  init
  result = parse(data_source, max_content_length: max_content_length)
  { content: result.content, metadata: result.metadata }
end

.parse_metadata(data_source, max_content_length: -1)) ⇒ `Object`

Deprecated.

Instead, get a ParseResult and access the metadata field

Regarding max_content_length, the default is set at 0 to save unnecessary processing, since the content is being ignored. However, the PDF metadata “pdf:unmappedUnicodeCharsPerPage” and “pdf:charsPerPage” will be absent if the max_content_length is 0, and otherwise may differ depending on the number of characters read.

# File 'lib/rika.rb', line 97

def self.parse_metadata(data_source, max_content_length: -1)
  init
  parse(data_source, max_content_length: max_content_length).metadata
end

.raise_unless_jruby ⇒ `Object`

Raise an error if not running under JRuby.

# File 'lib/rika.rb', line 109

def self.raise_unless_jruby
  unless RUBY_PLATFORM.match(/java/)
    raise "\n\n\nRika can only be run with JRuby! It needs access to the Java Virtual Machine.\n\n\n"
  end
end

.tika_language_detector ⇒ `Detector`

Returns Tika detector.

Returns:

(Detector) —

Tika detector

# File 'lib/rika.rb', line 103

def self.tika_language_detector
  init
  @tika_language_detector ||= OptimaizeLangDetector.new.loadModels
end

.tika_version ⇒ `String`

Returns version of loaded Tika jar file.

Returns:

(String) —

version of loaded Tika jar file

# File 'lib/rika.rb', line 50

def self.tika_version
  init
  Tika.java_class.package.implementation_version
end

Module: Rika

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.init ⇒ Module

.language(text) ⇒ String

.parse(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ ParseResult

.parse_content(data_source, max_content_length: -1)) ⇒ Parser

.parse_content_and_metadata(data_source, max_content_length: -1)) ⇒ Array<String,Hash>

.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)) ⇒ Hash

.parse_metadata(data_source, max_content_length: -1)) ⇒ Object

.raise_unless_jruby ⇒ Object

.tika_language_detector ⇒ Detector

.tika_version ⇒ String

.init ⇒ `Module`

.language(text) ⇒ `String`

.parse(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ `ParseResult`

.parse_content(data_source, max_content_length: -1)) ⇒ `Parser`

.parse_content_and_metadata(data_source, max_content_length: -1)) ⇒ `Array<String,Hash>`

.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)) ⇒ `Hash`

.parse_metadata(data_source, max_content_length: -1)) ⇒ `Object`

.raise_unless_jruby ⇒ `Object`

.tika_language_detector ⇒ `Detector`

.tika_version ⇒ `String`