Module: Rika

Defined in:
lib/rika.rb,
lib/rika/parser.rb,
lib/rika/version.rb,
lib/rika/formatters.rb,
lib/rika/tika_loader.rb,
lib/rika/parse_result.rb

Overview

Requires the Tika jar file, either from the default location (packaged with this gem) or from an override specified in the TIKA_JAR_FILESPEC environment variable.

Defined Under Namespace

Classes: Formatters, ParseResult, Parser, TikaLoadError, TikaLoader

Constant Summary collapse

PROJECT_URL =
'https://github.com/keithrbennett/rika'
VERSION =
'2.0.0'

Class Method Summary collapse

Class Method Details

.initModule

Loads the Tika jar file and imports the needed Java classes.

Returns:

  • (Module)

    the Rika module, for chaining



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# File 'lib/rika.rb', line 16

def self.init
  return if @initialized

  Rika.raise_unless_jruby

  Rika::TikaLoader.require_tika
  import java.io.FileInputStream
  import java.net.URL
  import org.apache.tika.Tika
  import org.apache.tika.detect.DefaultDetector
  import org.apache.tika.io.TikaInputStream
  import org.apache.tika.langdetect.optimaize.OptimaizeLangDetector
  import org.apache.tika.language.detect.LanguageDetector
  import org.apache.tika.language.detect.LanguageResult
  import org.apache.tika..Metadata

  @initialized = true
  self
end

.language(text) ⇒ String

Returns language of passed text, as 2-character ISO 639-1 code.

Parameters:

  • text (String)

    text to detect language of

Returns:

  • (String)

    language of passed text, as 2-character ISO 639-1 code



57
58
59
60
# File 'lib/rika.rb', line 57

def self.language(text)
  init
  tika_language_detector.detect(text.to_java_string).get_language
end

.parse(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ ParseResult

Gets a ParseResult from parsing a document.

Parameters:

  • data_source (String)

    file path or HTTP(s) URL

  • key_sort (Boolean) (defaults to: true)

    whether to sort the keys in the metadata hash, defaults to true

  • max_content_length (Integer) (defaults to: -1,)

    maximum content length to return, defaults to all

  • detector (Detector) (defaults to: DefaultDetector.new)

    Tika detector, defaults to DefaultDetector

Returns:



43
44
45
46
47
# File 'lib/rika.rb', line 43

def self.parse(data_source, key_sort: true, max_content_length: -1, detector: DefaultDetector.new)
  init
  parser = Parser.new(data_source, key_sort: key_sort, max_content_length: max_content_length, detector: detector)
  parser.parse
end

.parse_content(data_source, max_content_length: -1)) ⇒ Parser

Deprecated.

Instead, get a ParseResult and access the content field

Returns parser for resource at specified location.

Parameters:

  • data_source (String)

    file path or HTTP URL

Returns:

  • (Parser)

    parser for resource at specified location



86
87
88
89
# File 'lib/rika.rb', line 86

def self.parse_content(data_source, max_content_length: -1)
  init
  parse(data_source, max_content_length: max_content_length).content
end

.parse_content_and_metadata(data_source, max_content_length: -1)) ⇒ Array<String,Hash>

Deprecated.

Instead, get a ParseResult and access the content and metadata fields.

Returns content and metadata of file at specified location.

Parameters:

  • data_source (String)

    file path or HTTP URL

Returns:

  • (Array<String,Hash>)

    content and metadata of file at specified location



66
67
68
69
70
# File 'lib/rika.rb', line 66

def self.(data_source, max_content_length: -1)
  init
  result = parse(data_source, max_content_length: max_content_length)
  [result.content, result.]
end

.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)) ⇒ Hash

Deprecated.

Instead, use a ParseResult or its to_h method.

Returns content and metadata of file at specified location.

Parameters:

  • data_source (String)

    file path or HTTP URL

Returns:

  • (Hash)

    content and metadata of file at specified location



76
77
78
79
80
# File 'lib/rika.rb', line 76

def self.(data_source, max_content_length: -1)
  init
  result = parse(data_source, max_content_length: max_content_length)
  { content: result.content, metadata: result. }
end

.parse_metadata(data_source, max_content_length: -1)) ⇒ Object

Deprecated.

Instead, get a ParseResult and access the metadata field

Regarding max_content_length, the default is set at 0 to save unnecessary processing, since the content is being ignored. However, the PDF metadata “pdf:unmappedUnicodeCharsPerPage” and “pdf:charsPerPage” will be absent if the max_content_length is 0, and otherwise may differ depending on the number of characters read.



97
98
99
100
# File 'lib/rika.rb', line 97

def self.(data_source, max_content_length: -1)
  init
  parse(data_source, max_content_length: max_content_length).
end

.raise_unless_jrubyObject

Raise an error if not running under JRuby.



109
110
111
112
113
# File 'lib/rika.rb', line 109

def self.raise_unless_jruby
  unless RUBY_PLATFORM.match(/java/)
    raise "\n\n\nRika can only be run with JRuby! It needs access to the Java Virtual Machine.\n\n\n"
  end
end

.tika_language_detectorDetector

Returns Tika detector.

Returns:

  • (Detector)

    Tika detector



103
104
105
106
# File 'lib/rika.rb', line 103

def self.tika_language_detector
  init
  @tika_language_detector ||= OptimaizeLangDetector.new.loadModels
end

.tika_versionString

Returns version of loaded Tika jar file.

Returns:

  • (String)

    version of loaded Tika jar file



50
51
52
53
# File 'lib/rika.rb', line 50

def self.tika_version
  init
  Tika.java_class.package.implementation_version
end