Digestif

An aid for creating hash digests of large files

Synopsis

Digestif lets you create fast checksums of large files by skipping sections of the file. It was created with compressed media files in mind, which generally have such a high information density that we can get away with a checksum that doesn’t actually consider all the bits. Someday I’d like to understand the likelyhood-of-collision implications for specific compression algorithms (mp3, h.264, xvid, et al.), but right now I’m going to settle for guessing at where “good enough for me” might lie.

One side-effect of this approach is that the error-corrective nature of digests is, of course, lost. This is really more of an inescapable artifact of the problem we’re trying to solve. To create a hash of a really large file, the biggest bottleneck with modern computers is streaming 5-10 gigs off of the disk. The actual checksumming is not hard. By looking at less data, we speed up the hash process immensely, and we incur the cost of vulnerability of file corruption. Because the purpose I have in mind for this tool is identity checking, not corruption detection, this issue is not a problem for me.

Installation


gem install digestif

Usage

Just like md5 on the command line, but it only works on files, not on streaming data (can’t seek a stream).


digestif some_large_file

Since this program is designed to get around file limitations specifically, it didn’t make sense for me to invest in making streams work.

For a detailed look at the options, see


digestif --help

Motivation

I wrote digestif to solve a problem for a media catalogue I was working on. I wanted a filename-independent way to evaluate whether or not a file was in the catalogue yet, but the files were so large that streaming the whole file off of the hard drive was too slow for the response time I was hoping for. (Interested parties, I was getting 5 gigs hashed using md5 in about 2.4 minutes.)

Author

Copyright 2011 Andrew Roberts