Class: ZipTricks::FileReader

Inherits:
Object
  • Object
show all
Defined in:
lib/zip_tricks/file_reader.rb

Overview

A very barebones ZIP file reader. Is made for maximum interoperability, but at the same time we attempt to keep it somewhat concise.

REALLY CRAZY IMPORTANT STUFF: SECURITY IMPLICATIONS

Please BEWARE - using this is a security risk if you are reading files that have been supplied by users. This implementation has not been formally verified for correctness. As ZIP files contain relative offsets in lots of places it might be possible for a maliciously crafted ZIP file to put the decode procedure in an endless loop, make it attempt huge reads from the input file and so on. Additionally, the reader module for deflated data has no support for ZIP bomb protection. So either limit the FileReader usage to the files you trust, or triple-check all the inputs upfront. Patches to make this reader more secure are welcome of course.

Usage

File.open('zipfile.zip', 'rb') do |f|
  entries = FileReader.read_zip_structure(f)
  entries.each do |e|
    File.open(e.filename, 'wb') do |extracted_file|
      ex = e.extractor_from(f)
      extracted_file << ex.extract(1024 * 1024) until ex.eof?
    end
  end
end

Supported features

  • Deflate and stored storage modes
  • Zip64 (extra fields and offsets)
  • Data descriptors

Unsupported features

  • Archives split over multiple disks/files
  • Any ZIP encryption
  • EFS language flag and InfoZIP filename extra field
  • CRC32 checksums are not verified

Mode of operation

By default, FileReader ignores the data in local file headers (as it is often unreliable). It reads the ZIP file "from the tail", finds the end-of-central-directory signatures, then reads the central directory entries, reconstitutes the entries with their filenames, attributes and so on, and sets these entries up with the absolute offsets into the source file/IO object. These offsets can then be used to extract the actual compressed data of the files and to expand it.

Recovering damaged or incomplete ZIP files

If the ZIP file you are trying to read does not contain the central directory records read_zip_structure will not work, since it starts the read process from the EOCD marker at the end of the central directory and then crawls "back" in the IO to figure out the rest. You can explicitly apply a fallback for reading the archive "straight ahead" instead using read_zip_straight_ahead

  • the method will instead scan your IO from the very start, skipping over the actual entry data. This is less efficient than central directory parsing since it involves a much larger number of reads (1 read from the IO per entry in the ZIP).

Defined Under Namespace

Classes: ZipEntry

Constant Summary collapse

ReadError =
Class.new(StandardError)
UnsupportedFeature =
Class.new(StandardError)
InvalidStructure =
Class.new(ReadError)
LocalHeaderPending =
Class.new(StandardError) do
  def message
    'The compressed data offset is not available (local header has not been read)'
  end
end
MissingEOCD =
Class.new(StandardError) do
  def message
    'Could not find the EOCD signature in the buffer - maybe a malformed ZIP file'
  end
end

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.read_zip_straight_ahead(**options) ⇒ Array<ZipEntry>

Parse an IO handle to a ZIP archive into an array of Entry objects, reading from the start of the file and parsing local file headers one-by-one

Parameters:

  • options (Hash)

    any options the instance method of the same name accepts

Returns:

  • (Array<ZipEntry>)

    an array of entries within the ZIP being parsed

See Also:

  • {{#read_zip_straight_ahead}


391
392
393
# File 'lib/zip_tricks/file_reader.rb', line 391

def self.read_zip_straight_ahead(**options)
  new.read_zip_straight_ahead(**options)
end

.read_zip_structure(**options) ⇒ Array<ZipEntry>

Parse an IO handle to a ZIP archive into an array of Entry objects, reading from the end of the IO object.

Parameters:

  • options (Hash)

    any options the instance method of the same name accepts

Returns:

  • (Array<ZipEntry>)

    an array of entries within the ZIP being parsed

See Also:

  • {{#read_zip_structure}


381
382
383
# File 'lib/zip_tricks/file_reader.rb', line 381

def self.read_zip_structure(**options)
  new.read_zip_structure(**options)
end

Instance Method Details

#get_compressed_data_offset(io:, local_file_header_offset:) ⇒ Object

Get the offset in the IO at which the actual compressed data of the file starts within the ZIP. The method will eager-read the entire local header for the file (the maximum size the local header may use), starting at the given offset, and will then compute its size. That size plus the local header offset given will be the compressed data offset of the entry (read starting at this offset to get the data).

local file header is supposed to begin @return [Fixnum] absolute offset (0-based) of where the compressed data begins for this file within the ZIP

Parameters:

  • io (#seek, #read)

    an IO-ish object the ZIP file can be read from

  • local_header_offset (Fixnum)

    absolute offset (0-based) where the



369
370
371
372
373
# File 'lib/zip_tricks/file_reader.rb', line 369

def get_compressed_data_offset(io:, local_file_header_offset:)
  seek(io, local_file_header_offset)
  entry_recovered_from_local_file_header = read_local_file_header(io: io)
  entry_recovered_from_local_file_header.compressed_data_offset
end

#read_local_file_header(io:) ⇒ Array<ZipEntry, Fixnum>

Parse the local header entry and get the offset in the IO at which the actual compressed data of the file starts within the ZIP. The method will eager-read the entire local header for the file (the maximum size the local header may use), starting at the given offset, and will then compute its size. That size plus the local header offset given will be the compressed data offset of the entry (read starting at this offset to get the data).

the compressed data offset

Parameters:

  • io (#read)

    an IO-ish object the ZIP file can be read from

Returns:

  • (Array<ZipEntry, Fixnum>)

    the parsed local header entry and

Raises:



301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
# File 'lib/zip_tricks/file_reader.rb', line 301

def read_local_file_header(io:)
  local_file_header_offset = io.tell

  # Reading in bulk is cheaper - grab the maximum length of the local header,
  # including any headroom for extra fields etc.
  local_file_header_str_plus_headroom = io.read(MAX_LOCAL_HEADER_SIZE)
  raise ReadError if local_file_header_str_plus_headroom.nil? # reached EOF

  io_starting_at_local_header = StringIO.new(local_file_header_str_plus_headroom)

  assert_signature(io_starting_at_local_header, 0x04034b50)
  e = ZipEntry.new
  e.version_needed_to_extract = read_2b(io_starting_at_local_header) # Version needed to extract
  e.gp_flags = read_2b(io_starting_at_local_header) # gp flags
  e.storage_mode = read_2b(io_starting_at_local_header) # storage mode
  e.dos_time = read_2b(io_starting_at_local_header) # dos time
  e.dos_date = read_2b(io_starting_at_local_header) # dos date
  e.crc32 = read_4b(io_starting_at_local_header) # CRC32
  e.compressed_size = read_4b(io_starting_at_local_header) # Comp size
  e.uncompressed_size = read_4b(io_starting_at_local_header) # Uncomp size

  filename_size = read_2b(io_starting_at_local_header)
  extra_size = read_2b(io_starting_at_local_header)
  e.filename = read_n(io_starting_at_local_header, filename_size)
  extra_fields_str = read_n(io_starting_at_local_header, extra_size)

  # Parse out the extra fields
  extra_table = parse_out_extra_fields(extra_fields_str)

  # ...of which we really only need the Zip64 extra
  if zip64_extra_contents = extra_table[1]
    # If the Zip64 extra is present, we let it override all
    # the values fetched from the conventional header
    zip64_extra = StringIO.new(zip64_extra_contents)
    log do
      'Will read Zip64 extra data from local header field for %<filename>s, %<size>d bytes' %
        {filename: e.filename, size: zip64_extra.size}
    end
    # Now here be dragons. The APPNOTE specifies that
    #
    # > The order of the fields in the ZIP64 extended
    # > information record is fixed, but the fields will
    # > only appear if the corresponding Local or Central
    # > directory record field is set to 0xFFFF or 0xFFFFFFFF.
    #
    # It means that before we read this stuff we need to check if the previously-read
    # values are at overflow, and only _then_ proceed to read them. Bah.
    e.uncompressed_size = read_8b(zip64_extra) if e.uncompressed_size == 0xFFFFFFFF
    e.compressed_size = read_8b(zip64_extra) if e.compressed_size == 0xFFFFFFFF
  end

  offset = local_file_header_offset + io_starting_at_local_header.tell
  e.compressed_data_offset = offset

  e
end

#read_zip_straight_ahead(io:) ⇒ Object

Sometimes you might encounter truncated ZIP files, which do not contain any central directory whatsoever - or where the central directory is truncated. In that case, employing the technique of reading the ZIP "from the end" is impossible, and the only recourse is reading each local file header in sucession. If the entries in such a ZIP use data descriptors, you would need to scan after the entry until you encounter the data descriptor signature - and that might be unreliable at best. Therefore, this reading technique does not support data descriptors. It can however recover the entries you still can read if these entries contain all the necessary information about the contained file.

headers from @return [Array] an array of entries that could be recovered before hitting EOF

Parameters:

  • io (#tell, #read, #seek)

    the IO-ish object to read the local file



263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# File 'lib/zip_tricks/file_reader.rb', line 263

def read_zip_straight_ahead(io:)
  entries = []
  loop do
    cur_offset = io.tell
    entry = read_local_file_header(io: io)
    if entry.uses_data_descriptor?
      raise UnsupportedFeature, "The local file header at #{cur_offset} uses \
                                a data descriptor and the start of next entry \
                                cannot be found"
    end
    entries << entry
    next_local_header_offset = entry.compressed_data_offset + entry.compressed_size
    log do
      'Recovered a local file file header at offset %<cur_offset>d, seeking to the next at %<header_offset>d' %
        {cur_offset: cur_offset, header_offset: next_local_header_offset}
    end
    seek(io, next_local_header_offset) # Seek to the next entry, and raise if seek is impossible
  end
  entries
rescue ReadError
  log do
    'Got a read/seek error after reaching %<cur_offset>d, no more entries can be recovered' %
      {cur_offset: cur_offset}
  end
  entries
end

#read_zip_structure(io:, read_local_headers: true) ⇒ Array<ZipEntry>

Parse an IO handle to a ZIP archive into an array of Entry objects.

Parameters:

  • io (#tell, #seek, #read, #size)

    an IO-ish object

  • read_local_headers (Boolean) (defaults to: true)

    whether the local headers must be read upfront. When reading a locally available ZIP file this option will not have much use since the small reads from the file handle are not going to be that important. However, if you are using remote reads to decipher a ZIP file located on an HTTP server, the operation must perform an HTTP request for each entry in the ZIP file to determine where the actual file data starts. This, for a ZIP archive of 1000 files, will incur 1000 extra HTTP requests - which you might not want to perform upfront, or - at least - not want to perform at once. When the option is set to false, you will be getting instances of LazyEntry instead of Entry. Those objects will raise an exception when you attempt to access their compressed data offset in the ZIP (since the reads have not been performed yet). As a rule, this option can be left in it's default setting (true) unless you want to only read the central directory, or you need to limit the number of HTTP requests.

Returns:

  • (Array<ZipEntry>)

    an array of entries within the ZIP being parsed



197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# File 'lib/zip_tricks/file_reader.rb', line 197

def read_zip_structure(io:, read_local_headers: true)
  zip_file_size = io.size
  eocd_offset = get_eocd_offset(io, zip_file_size)

  zip64_end_of_cdir_location = get_zip64_eocd_location(io, eocd_offset)
  num_files, cdir_location, _cdir_size =
    if zip64_end_of_cdir_location
      num_files_and_central_directory_offset_zip64(io, zip64_end_of_cdir_location)
    else
      num_files_and_central_directory_offset(io, eocd_offset)
    end

  log do
    'Located the central directory start at %<location>d' %
      {location: cdir_location}
  end
  seek(io, cdir_location)

  # Read the entire central directory AND anything behind it, in one fell swoop.
  # Strictly speaking, we should be able to read `cdir_size` bytes and not a byte more.
  # However, we know for a fact that in some of our files the central directory size
  # is in fact misreported. `zipinfo` then says:
  #
  #    warning [ktsglobal-2b03bc.zip]:  1 extra byte at beginning or within zipfile
  #      (attempting to process anyway)
  #    error [ktsglobal-2b03bc.zip]:  reported length of central directory is
  #      -1 bytes too long (Atari STZip zipfile?  J.H.Holm ZIPSPLIT 1.1
  #      zipfile?).  Compensating...
  #
  # Since the EOCD is not that big anyway, we just read the entire "tail" of the ZIP ignoring
  # the central directory size alltogether.
  central_directory_str = io.read # and not read_n(io, cdir_size), see above
  central_directory_io = StringIO.new(central_directory_str)
  log do
    'Read %<byte_size>d bytes with central directory + EOCD record and locator' %
      {byte_size: central_directory_str.bytesize}
  end

  entries = (0...num_files).map do |entry_n|
    offset_location = cdir_location + central_directory_io.tell
    log do
      'Reading the central directory entry %<entry_n>d starting at offset %<offset>d' %
        {entry_n: entry_n, offset: offset_location}
    end
    read_cdir_entry(central_directory_io)
  end

  read_local_headers(entries, io) if read_local_headers

  entries
end