Module: Rika
- Defined in:
- lib/rika.rb,
lib/rika/parser.rb,
lib/rika/version.rb,
lib/rika/formatters.rb,
lib/rika/tika_loader.rb,
lib/rika/parse_result.rb
Overview
Requires the Tika jar file, either from the default location (packaged with this gem) or from an override specified in the TIKA_JAR_FILESPEC environment variable.
Defined Under Namespace
Classes: Formatters, ParseResult, Parser, TikaLoadError, TikaLoader
Constant Summary collapse
- PROJECT_URL =
'https://github.com/keithrbennett/rika'- VERSION =
'2.0.0'
Class Method Summary collapse
-
.init ⇒ Module
Loads the Tika jar file and imports the needed Java classes.
-
.language(text) ⇒ String
Language of passed text, as 2-character ISO 639-1 code.
-
.parse(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ ParseResult
Gets a ParseResult from parsing a document.
-
.parse_content(data_source, max_content_length: -1)) ⇒ Parser
deprecated
Deprecated.
Instead, get a ParseResult and access the content field
-
.parse_content_and_metadata(data_source, max_content_length: -1)) ⇒ Array<String,Hash>
deprecated
Deprecated.
Instead, get a ParseResult and access the content and metadata fields.
-
.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)) ⇒ Hash
deprecated
Deprecated.
Instead, use a ParseResult or its to_h method.
-
.parse_metadata(data_source, max_content_length: -1)) ⇒ Object
deprecated
Deprecated.
Instead, get a ParseResult and access the metadata field
-
.raise_unless_jruby ⇒ Object
Raise an error if not running under JRuby.
-
.tika_language_detector ⇒ Detector
Tika detector.
-
.tika_version ⇒ String
Version of loaded Tika jar file.
Class Method Details
.init ⇒ Module
Loads the Tika jar file and imports the needed Java classes.
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# File 'lib/rika.rb', line 16 def self.init return if @initialized Rika.raise_unless_jruby Rika::TikaLoader.require_tika import java.io.FileInputStream import java.net.URL import org.apache.tika.Tika import org.apache.tika.detect.DefaultDetector import org.apache.tika.io.TikaInputStream import org.apache.tika.langdetect.optimaize.OptimaizeLangDetector import org.apache.tika.language.detect.LanguageDetector import org.apache.tika.language.detect.LanguageResult import org.apache.tika..Metadata @initialized = true self end |
.language(text) ⇒ String
Returns language of passed text, as 2-character ISO 639-1 code.
57 58 59 60 |
# File 'lib/rika.rb', line 57 def self.language(text) init tika_language_detector.detect(text.to_java_string).get_language end |
.parse(data_source, key_sort: true, max_content_length: -1,, detector: DefaultDetector.new) ⇒ ParseResult
Gets a ParseResult from parsing a document.
43 44 45 46 47 |
# File 'lib/rika.rb', line 43 def self.parse(data_source, key_sort: true, max_content_length: -1, detector: DefaultDetector.new) init parser = Parser.new(data_source, key_sort: key_sort, max_content_length: max_content_length, detector: detector) parser.parse end |
.parse_content(data_source, max_content_length: -1)) ⇒ Parser
Instead, get a ParseResult and access the content field
Returns parser for resource at specified location.
86 87 88 89 |
# File 'lib/rika.rb', line 86 def self.parse_content(data_source, max_content_length: -1) init parse(data_source, max_content_length: max_content_length).content end |
.parse_content_and_metadata(data_source, max_content_length: -1)) ⇒ Array<String,Hash>
Instead, get a ParseResult and access the content and metadata fields.
Returns content and metadata of file at specified location.
66 67 68 69 70 |
# File 'lib/rika.rb', line 66 def self.(data_source, max_content_length: -1) init result = parse(data_source, max_content_length: max_content_length) [result.content, result.] end |
.parse_content_and_metadata_as_hash(data_source, max_content_length: -1)) ⇒ Hash
Instead, use a ParseResult or its to_h method.
Returns content and metadata of file at specified location.
76 77 78 79 80 |
# File 'lib/rika.rb', line 76 def self.(data_source, max_content_length: -1) init result = parse(data_source, max_content_length: max_content_length) { content: result.content, metadata: result. } end |
.parse_metadata(data_source, max_content_length: -1)) ⇒ Object
Instead, get a ParseResult and access the metadata field
Regarding max_content_length, the default is set at 0 to save unnecessary processing, since the content is being ignored. However, the PDF metadata “pdf:unmappedUnicodeCharsPerPage” and “pdf:charsPerPage” will be absent if the max_content_length is 0, and otherwise may differ depending on the number of characters read.
97 98 99 100 |
# File 'lib/rika.rb', line 97 def self.(data_source, max_content_length: -1) init parse(data_source, max_content_length: max_content_length). end |
.raise_unless_jruby ⇒ Object
Raise an error if not running under JRuby.
109 110 111 112 113 |
# File 'lib/rika.rb', line 109 def self.raise_unless_jruby unless RUBY_PLATFORM.match(/java/) raise "\n\n\nRika can only be run with JRuby! It needs access to the Java Virtual Machine.\n\n\n" end end |
.tika_language_detector ⇒ Detector
Returns Tika detector.
103 104 105 106 |
# File 'lib/rika.rb', line 103 def self.tika_language_detector init @tika_language_detector ||= OptimaizeLangDetector.new.loadModels end |
.tika_version ⇒ String
Returns version of loaded Tika jar file.
50 51 52 53 |
# File 'lib/rika.rb', line 50 def self.tika_version init Tika.java_class.package.implementation_version end |