Class: ZipTricks::FileReader
- Inherits:
- 
      Object
      
        - Object
- ZipTricks::FileReader
 
- Defined in:
- lib/zip_tricks/file_reader.rb
Overview
A very barebones ZIP file reader. Is made for maximum interoperability, but at the same time we attempt to keep it somewhat concise.
REALLY CRAZY IMPORTANT STUFF: SECURITY IMPLICATIONS
Please BEWARE - using this is a security risk if you are reading files that have been
supplied by users. This implementation has not been formally verified for correctness. As
ZIP files contain relative offsets in lots of places it might be possible for a maliciously
crafted ZIP file to put the decode procedure in an endless loop, make it attempt huge reads
from the input file and so on. Additionally, the reader module for deflated data has
no support for ZIP bomb protection. So either limit the FileReader usage to the files you
trust, or triple-check all the inputs upfront. Patches to make this reader more secure
are welcome of course.
Usage
File.open('zipfile.zip', 'rb') do |f|
  entries = FileReader.read_zip_structure(f)
  entries.each do |e|
    File.open(e.filename, 'wb') do |extracted_file|
      ex = e.extractor_from(f)
      extracted_file << ex.extract(1024 * 1024) until ex.eof?
    end
  end
end
Supported features
- Deflate and stored storage modes
- Zip64 (extra fields and offsets)
- Data descriptors
Unsupported features
- Archives split over multiple disks/files
- Any ZIP encryption
- EFS language flag and InfoZIP filename extra field
- CRC32 checksums are not verified
Mode of operation
By default, FileReader ignores the data in local file headers (as it is
often unreliable). It reads the ZIP file "from the tail", finds the
end-of-central-directory signatures, then reads the central directory entries,
reconstitutes the entries with their filenames, attributes and so on, and
sets these entries up with the absolute offsets into the source file/IO object.
These offsets can then be used to extract the actual compressed data of
the files and to expand it.
Recovering damaged or incomplete ZIP files
If the ZIP file you are trying to read does not contain the central directory
records read_zip_structure will not work, since it starts the read process
from the EOCD marker at the end of the central directory and then crawls
"back" in the IO to figure out the rest. You can explicitly apply a fallback
for reading the archive "straight ahead" instead using read_zip_straight_ahead
- the method will instead scan your IO from the very start, skipping over the actual entry data. This is less efficient than central directory parsing since it involves a much larger number of reads (1 read from the IO per entry in the ZIP).
Defined Under Namespace
Classes: ZipEntry
Constant Summary collapse
- ReadError =
- Class.new(StandardError) 
- UnsupportedFeature =
- Class.new(StandardError) 
- InvalidStructure =
- Class.new(ReadError) 
- LocalHeaderPending =
- Class.new(StandardError) do def 'The compressed data offset is not available (local header has not been read)' end end 
- MissingEOCD =
- Class.new(StandardError) do def 'Could not find the EOCD signature in the buffer - maybe a malformed ZIP file' end end 
Class Method Summary collapse
- 
  
    
      .read_zip_straight_ahead(**options)  ⇒ Array<ZipEntry> 
    
    
  
  
  
  
  
  
  
  
  
    Parse an IO handle to a ZIP archive into an array of Entry objects, reading from the start of the file and parsing local file headers one-by-one. 
- 
  
    
      .read_zip_structure(**options)  ⇒ Array<ZipEntry> 
    
    
  
  
  
  
  
  
  
  
  
    Parse an IO handle to a ZIP archive into an array of Entry objects, reading from the end of the IO object. 
Instance Method Summary collapse
- 
  
    
      #get_compressed_data_offset(io:, local_file_header_offset:)  ⇒ Object 
    
    
  
  
  
  
  
  
  
  
  
    Get the offset in the IO at which the actual compressed data of the file starts within the ZIP. 
- 
  
    
      #read_local_file_header(io:)  ⇒ Array<ZipEntry, Fixnum> 
    
    
  
  
  
  
  
  
  
  
  
    Parse the local header entry and get the offset in the IO at which the actual compressed data of the file starts within the ZIP. 
- 
  
    
      #read_zip_straight_ahead(io:)  ⇒ Object 
    
    
  
  
  
  
  
  
  
  
  
    Sometimes you might encounter truncated ZIP files, which do not contain any central directory whatsoever - or where the central directory is truncated. 
- 
  
    
      #read_zip_structure(io:, read_local_headers: true)  ⇒ Array<ZipEntry> 
    
    
  
  
  
  
  
  
  
  
  
    Parse an IO handle to a ZIP archive into an array of Entry objects. 
Class Method Details
.read_zip_straight_ahead(**options) ⇒ Array<ZipEntry>
Parse an IO handle to a ZIP archive into an array of Entry objects, reading from the start of the file and parsing local file headers one-by-one
| 391 392 393 | # File 'lib/zip_tricks/file_reader.rb', line 391 def self.read_zip_straight_ahead(**) new.read_zip_straight_ahead(**) end | 
.read_zip_structure(**options) ⇒ Array<ZipEntry>
Parse an IO handle to a ZIP archive into an array of Entry objects, reading from the end of the IO object.
| 381 382 383 | # File 'lib/zip_tricks/file_reader.rb', line 381 def self.read_zip_structure(**) new.read_zip_structure(**) end | 
Instance Method Details
#get_compressed_data_offset(io:, local_file_header_offset:) ⇒ Object
Get the offset in the IO at which the actual compressed data of the file starts within the ZIP. The method will eager-read the entire local header for the file (the maximum size the local header may use), starting at the given offset, and will then compute its size. That size plus the local header offset given will be the compressed data offset of the entry (read starting at this offset to get the data).
local file header is supposed to begin @return [Fixnum] absolute offset (0-based) of where the compressed data begins for this file within the ZIP
| 369 370 371 372 373 | # File 'lib/zip_tricks/file_reader.rb', line 369 def get_compressed_data_offset(io:, local_file_header_offset:) seek(io, local_file_header_offset) entry_recovered_from_local_file_header = read_local_file_header(io: io) entry_recovered_from_local_file_header.compressed_data_offset end | 
#read_local_file_header(io:) ⇒ Array<ZipEntry, Fixnum>
Parse the local header entry and get the offset in the IO at which the actual compressed data of the file starts within the ZIP. The method will eager-read the entire local header for the file (the maximum size the local header may use), starting at the given offset, and will then compute its size. That size plus the local header offset given will be the compressed data offset of the entry (read starting at this offset to get the data).
the compressed data offset
| 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 | # File 'lib/zip_tricks/file_reader.rb', line 301 def read_local_file_header(io:) local_file_header_offset = io.tell # Reading in bulk is cheaper - grab the maximum length of the local header, # including any headroom for extra fields etc. local_file_header_str_plus_headroom = io.read(MAX_LOCAL_HEADER_SIZE) raise ReadError if local_file_header_str_plus_headroom.nil? # reached EOF io_starting_at_local_header = StringIO.new(local_file_header_str_plus_headroom) assert_signature(io_starting_at_local_header, 0x04034b50) e = ZipEntry.new e.version_needed_to_extract = read_2b(io_starting_at_local_header) # Version needed to extract e.gp_flags = read_2b(io_starting_at_local_header) # gp flags e.storage_mode = read_2b(io_starting_at_local_header) # storage mode e.dos_time = read_2b(io_starting_at_local_header) # dos time e.dos_date = read_2b(io_starting_at_local_header) # dos date e.crc32 = read_4b(io_starting_at_local_header) # CRC32 e.compressed_size = read_4b(io_starting_at_local_header) # Comp size e.uncompressed_size = read_4b(io_starting_at_local_header) # Uncomp size filename_size = read_2b(io_starting_at_local_header) extra_size = read_2b(io_starting_at_local_header) e.filename = read_n(io_starting_at_local_header, filename_size) extra_fields_str = read_n(io_starting_at_local_header, extra_size) # Parse out the extra fields extra_table = parse_out_extra_fields(extra_fields_str) # ...of which we really only need the Zip64 extra if zip64_extra_contents = extra_table[1] # If the Zip64 extra is present, we let it override all # the values fetched from the conventional header zip64_extra = StringIO.new(zip64_extra_contents) log do 'Will read Zip64 extra data from local header field for %<filename>s, %<size>d bytes' % {filename: e.filename, size: zip64_extra.size} end # Now here be dragons. The APPNOTE specifies that # # > The order of the fields in the ZIP64 extended # > information record is fixed, but the fields will # > only appear if the corresponding Local or Central # > directory record field is set to 0xFFFF or 0xFFFFFFFF. # # It means that before we read this stuff we need to check if the previously-read # values are at overflow, and only _then_ proceed to read them. Bah. e.uncompressed_size = read_8b(zip64_extra) if e.uncompressed_size == 0xFFFFFFFF e.compressed_size = read_8b(zip64_extra) if e.compressed_size == 0xFFFFFFFF end offset = local_file_header_offset + io_starting_at_local_header.tell e.compressed_data_offset = offset e end | 
#read_zip_straight_ahead(io:) ⇒ Object
Sometimes you might encounter truncated ZIP files, which do not contain any central directory whatsoever - or where the central directory is truncated. In that case, employing the technique of reading the ZIP "from the end" is impossible, and the only recourse is reading each local file header in sucession. If the entries in such a ZIP use data descriptors, you would need to scan after the entry until you encounter the data descriptor signature - and that might be unreliable at best. Therefore, this reading technique does not support data descriptors. It can however recover the entries you still can read if these entries contain all the necessary information about the contained file.
headers from @return [Array
| 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 | # File 'lib/zip_tricks/file_reader.rb', line 263 def read_zip_straight_ahead(io:) entries = [] loop do cur_offset = io.tell entry = read_local_file_header(io: io) if entry.uses_data_descriptor? raise UnsupportedFeature, "The local file header at #{cur_offset} uses \ a data descriptor and the start of next entry \ cannot be found" end entries << entry next_local_header_offset = entry.compressed_data_offset + entry.compressed_size log do 'Recovered a local file file header at offset %<cur_offset>d, seeking to the next at %<header_offset>d' % {cur_offset: cur_offset, header_offset: next_local_header_offset} end seek(io, next_local_header_offset) # Seek to the next entry, and raise if seek is impossible end entries rescue ReadError log do 'Got a read/seek error after reaching %<cur_offset>d, no more entries can be recovered' % {cur_offset: cur_offset} end entries end | 
#read_zip_structure(io:, read_local_headers: true) ⇒ Array<ZipEntry>
Parse an IO handle to a ZIP archive into an array of Entry objects.
| 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | # File 'lib/zip_tricks/file_reader.rb', line 197 def read_zip_structure(io:, read_local_headers: true) zip_file_size = io.size eocd_offset = get_eocd_offset(io, zip_file_size) zip64_end_of_cdir_location = get_zip64_eocd_location(io, eocd_offset) num_files, cdir_location, _cdir_size = if zip64_end_of_cdir_location num_files_and_central_directory_offset_zip64(io, zip64_end_of_cdir_location) else num_files_and_central_directory_offset(io, eocd_offset) end log do 'Located the central directory start at %<location>d' % {location: cdir_location} end seek(io, cdir_location) # Read the entire central directory AND anything behind it, in one fell swoop. # Strictly speaking, we should be able to read `cdir_size` bytes and not a byte more. # However, we know for a fact that in some of our files the central directory size # is in fact misreported. `zipinfo` then says: # # warning [ktsglobal-2b03bc.zip]: 1 extra byte at beginning or within zipfile # (attempting to process anyway) # error [ktsglobal-2b03bc.zip]: reported length of central directory is # -1 bytes too long (Atari STZip zipfile? J.H.Holm ZIPSPLIT 1.1 # zipfile?). Compensating... # # Since the EOCD is not that big anyway, we just read the entire "tail" of the ZIP ignoring # the central directory size alltogether. central_directory_str = io.read # and not read_n(io, cdir_size), see above central_directory_io = StringIO.new(central_directory_str) log do 'Read %<byte_size>d bytes with central directory + EOCD record and locator' % {byte_size: central_directory_str.bytesize} end entries = (0...num_files).map do |entry_n| offset_location = cdir_location + central_directory_io.tell log do 'Reading the central directory entry %<entry_n>d starting at offset %<offset>d' % {entry_n: entry_n, offset: offset_location} end read_cdir_entry(central_directory_io) end read_local_headers(entries, io) if read_local_headers entries end |