Class: Henkei
- Inherits:
-
Object
- Object
- Henkei
- Defined in:
- lib/henkei.rb,
lib/henkei/version.rb,
lib/henkei/configuration.rb
Overview
Henkei monkey patch for configuration support
Defined Under Namespace
Classes: Configuration
Constant Summary collapse
- GEM_PATH =
rubocop:disable Metrics/ClassLength
File.dirname(File.dirname(__FILE__))
- JAR_PATH =
File.join(Henkei::GEM_PATH, 'jar', 'tika-app-2.9.1.jar')
- CONFIG_PATH =
File.join(Henkei::GEM_PATH, 'jar', 'tika-config.xml')
- CONFIG_WITHOUT_OCR_PATH =
File.join(Henkei::GEM_PATH, 'jar', 'tika-config-without-ocr.xml')
- VERSION =
'2.9.1.1'
Class Method Summary collapse
- .configuration ⇒ Object
- .configure {|configuration| ... } ⇒ Object
- .mimetype(content_type) ⇒ Object
-
.read(type, data, include_ocr: false) ⇒ Object
Read text or metadata from a data buffer.
Instance Method Summary collapse
-
#creation_date ⇒ Object
Returns
true
if the Henkei document was specified using a file path. -
#data ⇒ Object
Returns the raw/unparsed content of the Henkei document.
-
#html(include_ocr: false) ⇒ Object
Returns the text content of the Henkei document in HTML.
-
#initialize(input) ⇒ Henkei
constructor
Create a new instance of Henkei with a given document.
-
#metadata ⇒ Object
Returns the metadata hash of the Henkei document.
-
#mimetype ⇒ Object
Returns the mimetype object of the Henkei document.
-
#path? ⇒ Boolean
Returns
true
if the Henkei document was specified using a file path. -
#stream? ⇒ Boolean
Returns
true
if the Henkei document was specified from a stream or an object which responds toread
. -
#text(include_ocr: false) ⇒ Object
Returns the text content of the Henkei document.
-
#uri? ⇒ Boolean
Returns
true
if the Henkei document was specified using a URI.
Constructor Details
#initialize(input) ⇒ Henkei
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/henkei.rb', line 74 def initialize(input) if input.is_a? String if File.exist? input @path = input elsif input =~ URI::DEFAULT_PARSER.make_regexp @uri = URI.parse input else raise Errno::ENOENT, "missing file or invalid URI - #{input}" end elsif input.respond_to? :read @stream = input else raise TypeError, "can't read from #{input.class.name}" end end |
Class Method Details
.configuration ⇒ Object
5 6 7 |
# File 'lib/henkei/configuration.rb', line 5 def self.configuration @configuration ||= Configuration.new end |
.configure {|configuration| ... } ⇒ Object
9 10 11 |
# File 'lib/henkei/configuration.rb', line 9 def self.configure yield(configuration) end |
.mimetype(content_type) ⇒ Object
32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/henkei.rb', line 32 def self.mimetype(content_type) if Henkei.configuration.mime_library == 'mime/types' && defined?(MIME::Types) warn '[DEPRECATION] `mime/types` is deprecated. Please use `mini_mime` instead. ' \ 'Use Henkei.configure and assign "mini_mime" to `mime_library`.' MIME::Types[content_type].first else MiniMime.lookup_by_content_type(content_type).tap do |object| object.define_singleton_method(:extensions) { [extension] } end end end |
.read(type, data, include_ocr: false) ⇒ Object
50 51 52 53 54 55 56 57 58 |
# File 'lib/henkei.rb', line 50 def self.read(type, data, include_ocr: false) result = client_read(type, data, include_ocr: include_ocr) case type when :text, :html then result when :metadata then JSON.parse(result) when :mimetype then Henkei.mimetype(JSON.parse(result)['Content-Type']) end end |
Instance Method Details
#creation_date ⇒ Object
149 150 151 152 153 154 |
# File 'lib/henkei.rb', line 149 def creation_date return @creation_date if defined? @creation_date return unless ['dcterms:created'] @creation_date = Time.parse(['dcterms:created']) end |
#data ⇒ Object
189 190 191 192 193 194 195 196 197 198 199 200 201 |
# File 'lib/henkei.rb', line 189 def data return @data if defined? @data if path? @data = File.read @path elsif uri? @data = Net::HTTP.get @uri elsif stream? @data = @stream.read end @data end |
#html(include_ocr: false) ⇒ Object
114 115 116 117 118 |
# File 'lib/henkei.rb', line 114 def html(include_ocr: false) return @html if defined? @html @html = Henkei.read :html, data, include_ocr: include_ocr end |
#metadata ⇒ Object
125 126 127 128 129 |
# File 'lib/henkei.rb', line 125 def return @metadata if defined? @metadata @metadata = Henkei.read :metadata, data end |
#mimetype ⇒ Object
137 138 139 140 141 142 |
# File 'lib/henkei.rb', line 137 def mimetype return @mimetype if defined? @mimetype content_type = ['Content-Type'].is_a?(Array) ? ['Content-Type'].first : ['Content-Type'] @mimetype = Henkei.mimetype(content_type) end |
#path? ⇒ Boolean
161 162 163 |
# File 'lib/henkei.rb', line 161 def path? !!@path end |
#stream? ⇒ Boolean
180 181 182 |
# File 'lib/henkei.rb', line 180 def stream? !!@stream end |