Class: Mindee::Input::Source::LocalInputSource

Inherits:
Object
  • Object
show all
Defined in:
lib/mindee/input/sources/local_input_source.rb

Overview

Base class for loading documents.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(io_stream, filename, repair_pdf: false) ⇒ LocalInputSource

Returns a new instance of LocalInputSource.

Parameters:

  • io_stream (StringIO, File)
  • filename (String)
  • repair_pdf (bool) (defaults to: false)

Raises:



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/mindee/input/sources/local_input_source.rb', line 36

def initialize(io_stream, filename, repair_pdf: false)
  @io_stream = io_stream
  @filename = filename
  @file_mimetype = if repair_pdf
                     Marcel::MimeType.for @io_stream
                   else
                     Marcel::MimeType.for @io_stream, name: @filename
                   end
  if ALLOWED_MIME_TYPES.include? @file_mimetype
    logger.debug("Loaded new input #{@filename} from #{self.class}")
    return
  end

  if filename.end_with?('.pdf') && repair_pdf
    fix_pdf!

    logger.debug("Loaded new input #{@filename} from #{self.class}")
    return if ALLOWED_MIME_TYPES.include? @file_mimetype
  end

  raise Errors::MindeeMimeTypeError, @file_mimetype.to_s
end

Instance Attribute Details

#file_mimetypeString (readonly)

Returns:

  • (String)


29
30
31
# File 'lib/mindee/input/sources/local_input_source.rb', line 29

def file_mimetype
  @file_mimetype
end

#filenameString (readonly)

Returns:

  • (String)


27
28
29
# File 'lib/mindee/input/sources/local_input_source.rb', line 27

def filename
  @filename
end

#io_streamStringIO | File (readonly)

Returns:

  • (StringIO | File)


31
32
33
# File 'lib/mindee/input/sources/local_input_source.rb', line 31

def io_stream
  @io_stream
end

Class Method Details

.fix_pdf(stream, maximum_offset: 500) ⇒ StringIO

Attempt to fix the PDF data in the given stream.

Parameters:

  • stream (StringIO)

    The stream to fix.

  • maximum_offset (Integer) (defaults to: 500)

    Maximum offset to look for the PDF header.

Returns:

  • (StringIO)

    The fixed stream.

Raises:



84
85
86
87
88
89
90
91
# File 'lib/mindee/input/sources/local_input_source.rb', line 84

def self.fix_pdf(stream, maximum_offset: 500)
  out_stream = StringIO.new
  stream.gets('%PDF-')
  raise Errors::MindeePDFError if stream.eof? || stream.pos > maximum_offset

  stream.pos = stream.pos - 5
  out_stream << stream.read
end

Instance Method Details

#apply_page_options(options) ⇒ Object

Cuts a PDF file according to provided options.

Parameters:

  • options (PageOptions, nil)

    Page cutting/merge options:

    • :page_indexes Zero-based list of page indexes.
    • :operation Operation to apply on the document, given the `page_indexes specified:
      • :KEEP_ONLY - keep only the specified pages, and remove all others.
      • :REMOVE - remove the specified pages, and keep all others.
    • :on_min_pages Apply the operation only if document has at least this many pages.


101
102
103
104
# File 'lib/mindee/input/sources/local_input_source.rb', line 101

def apply_page_options(options)
  @io_stream.seek(0)
  @io_stream = PDF::PDFProcessor.parse(@io_stream, options)
end

#compress!(quality: 85, max_width: nil, max_height: nil, force_source_text: false, disable_source_text: true) ⇒ Object

Compresses the file, according to the provided info.

Parameters:

  • quality (Integer) (defaults to: 85)

    Quality of the output file.

  • max_width (Integer, nil) (defaults to: nil)

    Maximum width (Ignored for PDFs).

  • max_height (Integer, nil) (defaults to: nil)

    Maximum height (Ignored for PDFs).

  • force_source_text (bool) (defaults to: false)

    Whether to force the operation on PDFs with source text. This will attempt to re-render PDF text over the rasterized original. If disabled, ignored the operation. WARNING: this operation is strongly discouraged.

  • disable_source_text (bool) (defaults to: true)

    If the PDF has source text, whether to re-apply it to the original or not. Needs force_source_text to work.



169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
# File 'lib/mindee/input/sources/local_input_source.rb', line 169

def compress!(quality: 85, max_width: nil, max_height: nil, force_source_text: false, disable_source_text: true)
  buffer = if pdf?
             Mindee::PDF::PDFCompressor.compress_pdf(
               @io_stream,
               quality: quality,
               force_source_text_compression: force_source_text,
               disable_source_text: disable_source_text
             )
           else
             Mindee::Image::ImageCompressor.compress_image(
               @io_stream,
               quality: quality,
               max_width: max_width,
               max_height: max_height
             )
           end
  @io_stream = buffer
  @io_stream.rewind
end

#count_pagesInteger

Deprecated.

Use #page_count instead.

Returns the page count for a document. Defaults to one for images.

Returns:

  • (Integer)


156
157
158
# File 'lib/mindee/input/sources/local_input_source.rb', line 156

def count_pages
  page_count
end

#fix_pdf!(maximum_offset: 500) ⇒ void

This method returns an undefined value.

Attempts to fix the PDF data in the file.

Parameters:

  • maximum_offset (Integer) (defaults to: 500)

    Maximum offset to look for the PDF header.

Raises:



73
74
75
76
77
# File 'lib/mindee/input/sources/local_input_source.rb', line 73

def fix_pdf!(maximum_offset: 500)
  @io_stream = LocalInputSource.fix_pdf(@io_stream, maximum_offset: maximum_offset)
  @io_stream.rewind
  @file_mimetype = Marcel::MimeType.for @io_stream
end

#page_countInteger

Returns the page count for a document. Defaults to one for images.

Returns:

  • (Integer)


144
145
146
147
148
149
150
# File 'lib/mindee/input/sources/local_input_source.rb', line 144

def page_count
  return 1 unless pdf?

  @io_stream.seek(0)
  pdf_processor = Mindee::PDF::PDFProcessor.open_pdf(@io_stream)
  pdf_processor.pages.size
end

#pdf?Boolean

Shorthand for PDF mimetype validation.

Returns:

  • (Boolean)


65
66
67
# File 'lib/mindee/input/sources/local_input_source.rb', line 65

def pdf?
  @file_mimetype.to_s == 'application/pdf'
end

#process_pdf(options) ⇒ Object

Deprecated.

Use #apply_page_options instead.



108
109
110
# File 'lib/mindee/input/sources/local_input_source.rb', line 108

def process_pdf(options)
  apply_page_options(options)
end

#read_contents(close: true) ⇒ Array<>

Reads a document.

Parameters:

  • close (bool) (defaults to: true)

Returns:

  • (Array<>)


115
116
117
118
119
120
121
122
123
# File 'lib/mindee/input/sources/local_input_source.rb', line 115

def read_contents(close: true)
  logger.debug("Reading data from: #{@filename}")
  @io_stream.seek(0)
  # Avoids needlessly re-packing some files
  data = @io_stream.read
  @io_stream.rewind
  @io_stream.close if close
  [data, { filename: Mindee::Input::Source.convert_to_unicode_escape(@filename) }]
end

#rescue_broken_pdf(_) ⇒ Object

Deprecated.

See #fix_pdf! or Mindee::Input::Source::LocalInputSource#self#self.fix_pdf instead.



60
61
62
# File 'lib/mindee/input/sources/local_input_source.rb', line 60

def rescue_broken_pdf(_)
  fix_pdf!
end

#source_text?bool

Checks whether the file has source text if it is a pdf. false otherwise

Returns:

  • (bool)

    true if the file is a PDF and has source text.



191
192
193
# File 'lib/mindee/input/sources/local_input_source.rb', line 191

def source_text?
  Mindee::PDF::PDFTools.source_text?(@io_stream)
end

#write_to_file(path) ⇒ Object

Write the file to a given path. Uses the initial file name by default.

Parameters:

  • path (String)

    Path to write the file to.



127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/mindee/input/sources/local_input_source.rb', line 127

def write_to_file(path)
  t_path = if File.directory?(path || '') || path.to_s.end_with?('/')
             File.join(path || '', @filename)
           else
             path
           end
  full_path = File.expand_path(t_path || '')
  FileUtils.mkdir_p(File.dirname(full_path))
  @io_stream.rewind
  File.binwrite(full_path, @io_stream.read || '')
  logger.debug("Wrote file successfully to #{full_path}")
  @io_stream.rewind
end