RFR: 8253952: Work around wrong usage of ZipOutputStream.putNextEntry() in user code

Tue Oct 6 13:51:21 UTC 2020

### Summary

Work around wrong usage of `ZipOutputStream.putNextEntry()` in user code which can lead to the `ZipException "invalid
entry compressed size"`.

### Motivation

In general it is not safe to directly write a ZipEntry obtained from `ZipInputStream.getNextEntry()`,
`ZipFile.entries()`, `ZipFile.getEntry()` or `ZipFile.stream()` with `ZipOutputStream.putNextEntry()` to a
`ZipOutputStream` and then read the entries data from the `ZipInputStream` and write it to the `ZipOutputStream` as
follows:
 ZipEntry entry;
 ZipInputStream zis = new ZipInputStream(...);
 ZipOutputStream zos = new ZipOutputStream(...);
 while((entry = zis.getNextEntry()) != null) {
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }
The problem with this code is that the zip file format does not record the compression level used for deflation in its
entries. In general, it doesn't even mandate a predefined compression ratio per compression level. Therefore the
compressed size recorded in a `ZipEntry` read from a zip file might differ from the new compressed size produced by the
receiving `ZipOutputStream`. Such a difference will result in a `ZipException` with the following message:
 java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)

The correct way of copying all entries from one zip file into another requires the creation of a new `ZipEntry` or at
least resetting of the compressed size field. E.g.:
 while((entry = zis.getNextEntry()) != null) {
     ZipEntry newEntry = new ZipEntry(entry.getName());
     zos.putNextEntry(newEntry);
     zis.transferTo(zos);
 }
or:
 while((entry = zis.getNextEntry()) != null) {
     entry.setCompressedSize(-1);
     zos.putNextEntry(entry);
     zis.transferTo(zos);
 }
Unfortunately, there's a lot of user code out there which gets this wrong and uses the bad coding pattern described
before. Searching for `"java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)"` gives
~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of instances of this anti-pattern on GitHub when
doing a code search for `ZipEntry` and `putNextEntry()`. E.g. [Gradle 4.x wrapper task][1] is affected as well as the
latest version of the [mockableAndroidJar task][2]. I've recently fixed two occurrences of this pattern in OpenJDK (see
[JDK-8240333][3] and [JDK-8240235][4]) but there still exist more of them (e.g.
[`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there since 1999 :).

### Description

So while this has clearly been a problem before, it apparently wasn't painful enough to trigger any action from the
side of the JDK. However, recently quite some zlib forks with [superior deflate/inflate performance have evolved][6].
Using them with OpenJDK is quite straight-forward: one just has to configure the alternative implementations by setting
`LD_LIBRARY_PATH` or `LD_PRELOAD` correspondingly. We've seen big saving by using these new zlib implementations for
selected services in production and the only reason why we haven't enabled them by default until now is the problem
I've just described. The reason why these new libraries uncover the described anti-pattern much more often is because
their compression ratio is slightly different from that of the default zlib library. This can easily trigger a
`ZipException` even if an application is not using a different compression levels but just a zip file created with
another zlib version.

I'd therefore like to propose the following workaround for the wrong `ZipOutputStream.putNextEntry()` usage in user
code:

-  ignore the compressed size if it was implicitly determined from the zip file and not explicitly set by calling
   `ZipEntry.setCompressedSize()`.

- Change the API-documentation of `ZipOutputStream.putNextEntry()` and `JarOutputStream.putNextEntry()` to explain the
  problem and why `putNextEntry()` will ignore the compressed size of a `ZipEntry` if that was set implicitely when
  reading that entry from a `ZipFile` or `ZipInputStream`.

### Technical Details

A zip file consists of a stream of File Entries followed by a Central Directory (see [here for a more detailed
specification][7]). Each File Entry is composed of a Local File Header (LFH) followed by the compressed Data and an
optional Data Descriptor. The LFH contains the File Name and among other attributes the Compressed and Uncompressed
size and CRC of the Data. In the case where the latter three attributes are not available at the time when the LFH is
created, this fact will be recorded in a flag of the LFH and will trigger the creation of a Data Descriptor with the
corresponding information right after the Data section. Finally, the Central Directory contains one Central Directory
File Header (CDFH) for each entry of the  zip archive. The CDFH is an extended version of the LFH and the ultimate
reference for the contents of the zip archive. The redundancy between LFH and CDFH is a tribute to zip's long history
when it was used to store archives on multiple floppy discs and the CDFH allowed to update the archive by only writing
to the last disc which contained the Central Directory.

`ZipEntries` read with `ZipInputStream.getNextEntry()` will initially only contain the information from the LFH. Only
after the next entry was read (or after `ZipInputStream.closeEntry()` was called explicitly), will the previously read
entry be updated with the data from the Data Descriptor. `ZipInputStream` doesn't inspect the Central Directory at all.

On the other hand, `ZipFile` only queries the Central Directory for `ZipEntry` information so all `ZipEntries` returned
by `ZipFile.entries()`, `ZipFile.getEntry()` and `ZipFile.stream()` will always instantly contain the full Compressed
and Uncompressed Size and CRC information for each entry independently of the LFH contents.

### Risks and Assumptions

If we choose to ignore the implicitly recorded compressed size in a `ZipEntry` read from a zip file when writing it to
a `ZipOutputStream`, this will lead to zip files with incomplete information in the LFH and an additional Data
Descriptor as described before. However, the result is still fully compatible to the zip file specification. It's also
not unusual, because by default all new zip files created with `ZipOutputStream` will contain LFHs without Compressed
and Uncompressed Size and CRC information and an additional Data Descriptor. Theoretically it is possible to create new
zip files with `ZipOutputStream` class and Compressed and Uncompressed Size and CRC information in the LFH but that's
complex and inefficient because it requires two steps. A first step to determine the crc and compressed size of the
data and a second step to actually write the data to the `ZipOutputStream` (which will compress it a second time). This
is because the current API offers no possibility to write already compressed data to a `ZipOutputStream`.

Consequently, the only straight-forward way of creating zip files from Java which have all the data in the LFH and no
Data Descriptor is by copying `ZipEntries` from an existing zip file with the buggy method described before. This
incidentally worked more or less reliable for a long time but breaks miserably when using different zlib
implementations. Ignoring the implicitly set compressed size of `ZipEntries` can easily fix this problem.

I'm not aware of any tool which can not handle such files and if it exists it would have problems with the majority of
Java created zip files anyway (e.g. all jar-files created with the jar tool have zip entries with incomplete LFH data
and additional Data Descriptor).

Ignoring the implicitly set compressed size of `ZipEntries` has no measurable performance impact and will increase the
size of zip archives which used to have the complete file information in the LFH before by 16 bytes per entry. On the
other hand it will give us the freedom to use whatever zip implementation we like :)

[1]:
https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158
[2]:
https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86
[3]: https://bugs.openjdk.java.net/browse/JDK-8240333 [4]: https://bugs.openjdk.java.net/browse/JDK-8240235 [5]:
https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java [6]:
https://github.com/simonis/zlib-bench/blob/master/Results.md [7]: https://en.wikipedia.org/wiki/Zip_(file_format)

-------------

Commit messages:
 - 8253952: Work around wrong usage of ZipOutputStream.putNextEntry() in user code

Changes: https://git.openjdk.java.net/jdk/pull/520/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=520&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8253952
  Stats: 244 lines in 4 files changed: 242 ins; 0 del; 2 mod
  Patch: https://git.openjdk.java.net/jdk/pull/520.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/520/head:pull/520

PR: https://git.openjdk.java.net/jdk/pull/520