Class: ZipTricks::FileReader

Inherits:
Object
  • Object
show all
Defined in:
lib/zip_tricks/file_reader.rb

Overview

A very barebones ZIP file reader. Is made for maximum interoperability, but at the same time we attempt to keep it somewhat concise.

REALLY CRAZY IMPORTANT STUFF: SECURITY IMPLICATIONS

Please BEWARE - using this is a security risk if you are reading files that have been supplied by users. This implementation has not been formally verified for correctness. As ZIP files contain relative offsets in lots of places it might be possible for a maliciously crafted ZIP file to put the decode procedure in an endless loop, make it attempt huge reads from the input file and so on. Additionally, the reader module for deflated data has no support for ZIP bomb protection. So either limit the FileReader usage to the files you trust, or triple-check all the inputs upfront. Patches to make this reader more secure are welcome of course.

Usage

File.open('zipfile.zip', 'rb') do |f|
  entries = FileReader.read_zip_structure(f)
  entries.each do |e|
    File.open(e.filename, 'wb') do |extracted_file|
      ex = e.extractor_from(f)
      extracted_file << ex.extract(1024 * 1024) until ex.eof?
    end
  end
end

Supported features

  • Deflate and stored storage modes
  • Zip64 (extra fields and offsets)
  • Data descriptors

Unsupported features

  • Archives split over multiple disks/files
  • Any ZIP encryption
  • EFS language flag and InfoZIP filename extra field
  • CRC32 checksums are not verified

Mode of operation

Basically, FileReader ignores the data in local file headers (as it is often unreliable). It reads the ZIP file "from the tail", finds the end-of-central-directory signatures, then reads the central directory entries, reconstitutes the entries with their filenames, attributes and so on, and sets these entries up with the absolute offsets into the source file/IO object. These offsets can then be used to extract the actual compressed data of the files and to expand it.

Defined Under Namespace

Classes: ZipEntry

Constant Summary collapse

ReadError =
Class.new(StandardError)
UnsupportedFeature =
Class.new(StandardError)
InvalidStructure =
Class.new(ReadError)
LocalHeaderPending =
Class.new(StandardError) do
  def message
    "The compressed data offset is not available (local header has not been read)" 
  end
end

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.read_zip_structure(**options) ⇒ Array<Entry>

Parse an IO handle to a ZIP archive into an array of Entry objects.

Parameters:

  • options (Hash)

    any options the instance method of the same name accepts

Returns:

  • (Array<Entry>)

    an array of entries within the ZIP being parsed



240
241
242
# File 'lib/zip_tricks/file_reader.rb', line 240

def self.read_zip_structure(**options)
  new.read_zip_structure(**options)
end

Instance Method Details

#get_compressed_data_offset(io:, local_file_header_offset:) ⇒ Fixnum

Get the offset in the IO at which the actual compressed data of the file starts within the ZIP. The method will eager-read the entire local header for the file (the maximum size the local header may use), starting at the given offset, and will then compute its size. That size plus the local header offset given will be the compressed data offset of the entry (read starting at this offset to get the data).

Parameters:

  • io (#seek, #read)

    an IO-ish object the ZIP file can be read from

  • local_header_offset (Fixnum)

    absolute offset (0-based) where the local file header is supposed to begin

Returns:

  • (Fixnum)

    absolute offset (0-based) of where the compressed data begins for this file within the ZIP



205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# File 'lib/zip_tricks/file_reader.rb', line 205

def get_compressed_data_offset(io:, local_file_header_offset:)
  seek(io, local_file_header_offset)
  
  # Reading in bulk is cheaper - grab the maximum length of the local header,
  # including any headroom
  local_file_header_str_plus_headroom = io.read(MAX_LOCAL_HEADER_SIZE)
  io_starting_at_local_header = StringIO.new(local_file_header_str_plus_headroom)

  assert_signature(io_starting_at_local_header, 0x04034b50)

  # The rest is unreliable, and we have that information from the central directory already.
  # So just skip over it to get at the offset where the compressed data begins
  skip_ahead_2(io_starting_at_local_header) # Version needed to extract
  skip_ahead_2(io_starting_at_local_header) # gp flags
  skip_ahead_2(io_starting_at_local_header) # storage mode
  skip_ahead_2(io_starting_at_local_header) # dos time
  skip_ahead_2(io_starting_at_local_header) # dos date
  skip_ahead_4(io_starting_at_local_header) # CRC32

  skip_ahead_4(io_starting_at_local_header) # Comp size
  skip_ahead_4(io_starting_at_local_header) # Uncomp size

  filename_size = read_2b(io_starting_at_local_header)
  extra_size = read_2b(io_starting_at_local_header)

  skip_ahead_n(io_starting_at_local_header, filename_size)
  skip_ahead_n(io_starting_at_local_header, extra_size)

  local_file_header_offset + io_starting_at_local_header.tell
end

#read_zip_structure(io:, read_local_headers: true) ⇒ Array<Entry>

Parse an IO handle to a ZIP archive into an array of Entry objects.

Parameters:

  • io (#tell, #seek, #read, #size)

    an IO-ish object

  • read_local_headers (Boolean) (defaults to: true)

    whether the local headers must be read upfront. When reading a locally available ZIP file this option will not have much use since the small reads from the file handle are not going to be that important. However, if you are using remote reads to decipher a ZIP file located on an HTTP server, the operation must perform an HTTP request for each entry in the ZIP file to determine where the actual file data starts. This, for a ZIP archive of 1000 files, will incur 1000 extra HTTP requests - which you might not want to perform upfront, or - at least - not want to perform at once. When the option is set to false, you will be getting instances of LazyEntry instead of Entry. Those objects will raise an exception when you attempt to access their compressed data offset in the ZIP (since the reads have not been performed yet). As a rule, this option can be left in it's default setting (true) unless you want to only read the central directory, or you need to limit the number of HTTP requests.

Returns:

  • (Array<Entry>)

    an array of entries within the ZIP being parsed



169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# File 'lib/zip_tricks/file_reader.rb', line 169

def read_zip_structure(io:, read_local_headers: true)
  zip_file_size = io.size
  eocd_offset = get_eocd_offset(io, zip_file_size)

  zip64_end_of_cdir_location = get_zip64_eocd_location(io, eocd_offset)
  num_files, cdir_location, cdir_size = if zip64_end_of_cdir_location
    num_files_and_central_directory_offset_zip64(io, zip64_end_of_cdir_location)
  else
    num_files_and_central_directory_offset(io, eocd_offset)
  end
  log { 'Located the central directory start at %d' % cdir_location }
  seek(io, cdir_location)

  # Read the entire central directory in one fell swoop
  central_directory_str = read_n(io, cdir_size)
  central_directory_io = StringIO.new(central_directory_str)
  log { 'Read %d bytes with central directory entries' % cdir_size }

  entries = (0...num_files).map do |entry_n|
    log { 'Reading the central directory entry %d starting at offset %d' % [entry_n, cdir_location + central_directory_io.tell] }
    read_cdir_entry(central_directory_io)
  end
  
  read_local_headers(entries, io) if read_local_headers
  
  entries
end