Request for Discussion: 8253952: Work around wrong usage of ZipOutputStream.putNextEntry() in user code

Fri Oct 2 19:44:49 UTC 2020

Hi,

I'd like to get your opinion on a workaround for a common but wrong
usage pattern of ZipOutputStream.putNextEntry() in user code. I've
created a Draft PR with all the gory details (which I'll include here
for your convenience). Please let me know what you think?

https://github.com/openjdk/jdk/pull/489
https://bugs.openjdk.java.net/browse/JDK-8253952

### Summary

Work around wrong usage of `ZipOutputStream.putNextEntry()` in user
code which can lead to the `ZipException "invalid entry compressed
size"`.

### Motivation

In general it is not safe to directly write a ZipEntry obtained from
`ZipInputStream.getNextEntry()`, `ZipFile.entries()`,
`ZipFile.getEntry()` or `ZipFile.stream()` with
`ZipOutputStream.putNextEntry()` to a `ZipOutputStream` and then read
the entries data from the `ZipInputStream` and write it to the
`ZipOutputStream` as follows:
```
 ZipEntry entry;
 ZipInputStream zis = new ZipInputStream(...);
 ZipOutputStream zos = new ZipOutputStream(...);
 while((entry = zis.getNextEntry()) != null) {
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }
```
The problem with this code is that the zip file format does not record
the compression level used for deflation in its entries. In general,
it doesn't even mandate a predefined compression ratio per compression
level. Therefore the compressed size recorded in a `ZipEntry` read
from a zip file might differ from the new compressed size produced by
the receiving `ZipOutputStream`. Such a difference will result in a
`ZipException` with the following message:
```
 java.util.zip.ZipException: invalid entry compressed size (expected
12 but got 7 bytes)
```
The correct way of copying all entries from one zip file into another
requires the creation of a new `ZipEntry` or at least resetting of the
compressed size field. E.g.:
```
 while((entry = zis.getNextEntry()) != null) {
     ZipEntry newEntry = new ZipEntry(entry.getName());
     zos.putNextEntry(newEntry);
     zis.transferTo(zos);
 }
```
or:
```
 while((entry = zis.getNextEntry()) != null) {
     entry.setCompressedSize(-1);
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }
```
Unfortunately, there's a lot of user code out there which gets this
wrong and uses the bad coding pattern described before. Searching for
`"java.util.zip.ZipException: invalid entry compressed size (expected
12 but got 7 bytes)"` gives ~2500 hits (~100 on StackOverflow). It's
also no hard to find plenty of instances of this anti-pattern on
GitHub when doing a code search for `ZipEntry` and `putNextEntry()`.
E.g. [Gradle 4.x wrapper task][1] is affected as well as the latest
version of the [mockableAndroidJar task][2]. I've recently fixed two
occurrences of this pattern in OpenJDK (see [JDK-8240333][3] and
[JDK-8240235][4]) but there still exist more of them (e.g.
[`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there
since 1999 :).

### Description

So while this has clearly been a problem before, it apparently wasn't
painful enough to trigger any action from the side of the JDK.
However, recently quite some zlib forks with [superior deflate/inflate
performance have evolved][6]. Using them with OpenJDK is quite
straight-forward: one just has to configure the alternative
implementations by setting `LD_LIBRARY_PATH` or `LD_PRELOAD`
correspondingly. We've seen big saving by using these new zlib
implementations for selected services in production and the only
reason why we haven't enabled them by default until now is the problem
I've just described. The reason why these new libraries uncover the
described anti-pattern much more often is because their compression
ratio is slightly different from that of the default zlib library.
This can easily trigger a `ZipException` even if an application is not
using a different compression levels but just a zip file created with
another zlib version.

I'd therefore like to propose and discuss the following workaround for
the wrong `ZipOutputStream.putNextEntry()` usage in user code:

- add an `@apiNote` to `ZipOutputStream.putNextEntry()` which explains
the problem in order to prevent future authors fromuseing it. I've
drafted a preliminary version of such a note in the [PR][8].

- add a system property to control the usage of `ZipEntry`'s
compressed size field in `ZipOutputStream.putNextEntry()`. The
property can be used to ignore the compressed size if it was
implicitly determined from the zip file and not explicitly set by
calling `ZipEntry.setCompressedSize()`. I've implemented a version of
this idea together with a regression test which demonstrates both, the
bad original behaviour and the mitigation with the help of the new
system property in [PR][8].

### Technical Details

A zip file consists of a stream of File Entries followed by a Central
Directory (see [here for a more detailed specification][7]). Each File
Entry is composed of a Local File Header (LFH) followed by the
compressed Data and an optional Data Descriptor. The LFH contains the
File Name and among other attributes the Compressed and Uncompressed
size and CRC of the Data. In the case where the latter three
attributes are not available at the time when the LFH is created, this
fact will be recorded in a flag of the LFH and will trigger the
creation of a Data Descriptor with the corresponding information right
after the Data section. Finally, the Central Directory contains one
Central Directory File Header (CDFH) for each entry of the zip
archive. The CDFH is an extended version of the LFH and the ultimate
reference for the contents of the zip archive. The redundancy between
LFH and CDFH is a tribute to zip's long history when it was used to
store archives on multiple floppy discs and the CDFH allowed to update
the archive by only writing to the last disc which contained the
Central Directory.

`ZipEntries` read with `ZipInputStream.getNextEntry()` will initially
only contain the information from the LFH. Only after the next entry
was read (or after `ZipInputStream.closeEntry()` was called
explicitly), will the previously read entry be updated with the data
from the Data Descriptor. `ZipInputStream` doesn't inspect the Central
Directory at all.

On the other hand, `ZipFile` only queries the Central Directory for
`ZipEntry` information so all `ZipEntries` returned by
`ZipFile.entries()`, `ZipFile.getEntry()` and `ZipFile.stream()` will
always instantly contain the full Compressed and Uncompressed Size and
CRC information for each entry independently of the LFH contents.

### Risks and Assumptions

If we choose to ignore the implicitly recorded compressed size in a
`ZipEntry` read from a zip file when writing it to a
`ZipOutputStream`, this will lead to zip files with incomplete
information in the LFH and an additional Data Descriptor as described
before. However, the result is still fully compatible to the zip file
specification. It's also not unusual, because by default all new zip
files created with `ZipOutputStream` will contain LFHs without
Compressed and Uncompressed Size and CRC information and an additional
Data Descriptor. Theoretically it is possible to create new zip files
with `ZipOutputStream` class and Compressed and Uncompressed Size and
CRC information in the LFH but that's complex and inefficient because
it requires two steps. A first step to determine the crc and
compressed size of the data and a second step to actually write the
data to the `ZipOutputStream` (which will compress it a second time).
This is because the current API offers no possibility to write already
compressed data to a `ZipOutputStream`.

Consequently, the only straight-forward way of creating zip files from
Java which have all the data in the LFH and no Data Descriptor is by
copying `ZipEntries` from an existing zip file with the buggy method
described before. This incidentally worked more or less reliable for a
long time but breaks miserably when using different zlib
implementations. Ignoring the implicitly set compressed size of
`ZipEntries` can easily fix this problem.

I'm not aware of any tool which can not handle such files and if it
exists it would have problems with the majority of Java created zip
files anyway (e.g. all jar-files created with the jar tool have zip
entries with incomplete LFH data and additional Data Descriptor).

Ignoring the implicitly set compressed size of `ZipEntries` has no
measurable performance impact and will increase the size of zip
archives which used to have the complete file information in the LFH
before by 16 bytes per entry. On the other hand it will give us the
freedom to use whatever zip implementation we like :)

[1]: https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158
[2]: https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86
[3]: https://bugs.openjdk.java.net/browse/JDK-8240333
[4]: https://bugs.openjdk.java.net/browse/JDK-8240235
[5]: https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java
[6]: https://github.com/simonis/zlib-bench/blob/master/Results.md
[7]: https://en.wikipedia.org/wiki/Zip_(file_format)
[8]: https://github.com/openjdk/jdk/pull/489