RFR: 8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size [v3]

Mon Oct 12 15:40:19 UTC 2020

On Mon, 12 Oct 2020 11:44:28 GMT, Volker Simonis <simonis at openjdk.org> wrote:

>> ### Summary
>> 
>> Work around wrong usage of `ZipOutputStream.putNextEntry()` in user code which can lead to the `ZipException "invalid
>> entry compressed size"`.
>> ### Motivation
>> 
>> In general it is not safe to directly write a ZipEntry obtained from `ZipInputStream.getNextEntry()`,
>> `ZipFile.entries()`, `ZipFile.getEntry()` or `ZipFile.stream()` with `ZipOutputStream.putNextEntry()` to a
>> `ZipOutputStream` and then read the entries data from the `ZipInputStream` and write it to the `ZipOutputStream` as
>> follows:
>>  ZipEntry entry;
>>  ZipInputStream zis = new ZipInputStream(...);
>>  ZipOutputStream zos = new ZipOutputStream(...);
>>  while((entry = zis.getNextEntry()) != null) {
>>      zos.putNextEntry(entry);
>>      zis.transferTo(zos);
>>  }
>> The problem with this code is that the zip file format does not record the compression level used for deflation in its
>> entries. In general, it doesn't even mandate a predefined compression ratio per compression level. Therefore the
>> compressed size recorded in a `ZipEntry` read from a zip file might differ from the new compressed size produced by the
>> receiving `ZipOutputStream`. Such a difference will result in a `ZipException` with the following message:
>>  java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)
>>  
>> The correct way of copying all entries from one zip file into another requires the creation of a new `ZipEntry` or at
>> least resetting of the compressed size field. E.g.:
>>  while((entry = zis.getNextEntry()) != null) {
>>      ZipEntry newEntry = new ZipEntry(entry.getName());
>>      zos.putNextEntry(newEntry);
>>      zis.transferTo(zos);
>>  }
>> or:
>>  while((entry = zis.getNextEntry()) != null) {
>>      entry.setCompressedSize(-1);
>>      zos.putNextEntry(entry);
>>      zis.transferTo(zos);
>>  }
>> Unfortunately, there's a lot of user code out there which gets this wrong and uses the bad coding pattern described
>> before. Searching for `"java.util.zip.ZipException: invalid entry compressed size (expected 12 but got 7 bytes)"` gives
>> ~2500 hits (~100 on StackOverflow). It's also no hard to find plenty of instances of this anti-pattern on GitHub when
>> doing a code search for `ZipEntry` and `putNextEntry()`. E.g. [Gradle 4.x wrapper task][1] is affected as well as the
>> latest version of the [mockableAndroidJar task][2]. I've recently fixed two occurrences of this pattern in OpenJDK (see
>> [JDK-8240333][3] and [JDK-8240235][4]) but there still exist more of them (e.g.
>> [`test/jdk/java/util/zip/ZipFile/CopyJar.java`][5] which is there since 1999 :).  ### Description  So while this has
>> clearly been a problem before, it apparently wasn't painful enough to trigger any action from the side of the JDK.
>> However, recently quite some zlib forks with [superior deflate/inflate performance have evolved][6]. Using them with
>> OpenJDK is quite straight-forward: one just has to configure the alternative implementations by setting
>> `LD_LIBRARY_PATH` or `LD_PRELOAD` correspondingly. We've seen big saving by using these new zlib implementations for
>> selected services in production and the only reason why we haven't enabled them by default until now is the problem
>> I've just described. The reason why these new libraries uncover the described anti-pattern much more often is because
>> their compression ratio is slightly different from that of the default zlib library. This can easily trigger a
>> `ZipException` even if an application is not using a different compression levels but just a zip file created with
>> another zlib version.  I'd therefore like to propose the following workaround for the wrong
>> `ZipOutputStream.putNextEntry()` usage in user code:
>> -  ignore the compressed size if it was implicitly determined from the zip file and not explicitly set by calling
>>    `ZipEntry.setCompressedSize()`.
>> 
>> - Change the API-documentation of `ZipOutputStream.putNextEntry()` and `JarOutputStream.putNextEntry()` to explain the
>>   problem and why `putNextEntry()` will ignore the compressed size of a `ZipEntry` if that was set implicitely when
>>   reading that entry from a `ZipFile` or `ZipInputStream`.
>> 
>> 
>> ### Technical Details
>> 
>> A zip file consists of a stream of File Entries followed by a Central Directory (see [here for a more detailed
>> specification][7]). Each File Entry is composed of a Local File Header (LFH) followed by the compressed Data and an
>> optional Data Descriptor. The LFH contains the File Name and among other attributes the Compressed and Uncompressed
>> size and CRC of the Data. In the case where the latter three attributes are not available at the time when the LFH is
>> created, this fact will be recorded in a flag of the LFH and will trigger the creation of a Data Descriptor with the
>> corresponding information right after the Data section. Finally, the Central Directory contains one Central Directory
>> File Header (CDFH) for each entry of the  zip archive. The CDFH is an extended version of the LFH and the ultimate
>> reference for the contents of the zip archive. The redundancy between LFH and CDFH is a tribute to zip's long history
>> when it was used to store archives on multiple floppy discs and the CDFH allowed to update the archive by only writing
>> to the last disc which contained the Central Directory.  `ZipEntries` read with `ZipInputStream.getNextEntry()` will
>> initially only contain the information from the LFH. Only after the next entry was read (or after
>> `ZipInputStream.closeEntry()` was called explicitly), will the previously read entry be updated with the data from the
>> Data Descriptor. `ZipInputStream` doesn't inspect the Central Directory at all.  On the other hand, `ZipFile` only
>> queries the Central Directory for `ZipEntry` information so all `ZipEntries` returned by `ZipFile.entries()`,
>> `ZipFile.getEntry()` and `ZipFile.stream()` will always instantly contain the full Compressed and Uncompressed Size and
>> CRC information for each entry independently of the LFH contents.  ### Risks and Assumptions  If we choose to ignore
>> the implicitly recorded compressed size in a `ZipEntry` read from a zip file when writing it to a `ZipOutputStream`,
>> this will lead to zip files with incomplete information in the LFH and an additional Data Descriptor as described
>> before. However, the result is still fully compatible to the zip file specification. It's also not unusual, because by
>> default all new zip files created with `ZipOutputStream` will contain LFHs without Compressed and Uncompressed Size and
>> CRC information and an additional Data Descriptor. Theoretically it is possible to create new zip files with
>> `ZipOutputStream` class and Compressed and Uncompressed Size and CRC information in the LFH but that's complex and
>> inefficient because it requires two steps. A first step to determine the crc and compressed size of the data and a
>> second step to actually write the data to the `ZipOutputStream` (which will compress it a second time). This is because
>> the current API offers no possibility to write already compressed data to a `ZipOutputStream`.  Consequently, the only
>> straight-forward way of creating zip files from Java which have all the data in the LFH and no Data Descriptor is by
>> copying `ZipEntries` from an existing zip file with the buggy method described before. This incidentally worked more or
>> less reliable for a long time but breaks miserably when using different zlib implementations. Ignoring the implicitly
>> set compressed size of `ZipEntries` can easily fix this problem.  I'm not aware of any tool which can not handle such
>> files and if it exists it would have problems with the majority of Java created zip files anyway (e.g. all jar-files
>> created with the jar tool have zip entries with incomplete LFH data and additional Data Descriptor).  Ignoring the
>> implicitly set compressed size of `ZipEntries` has no measurable performance impact and will increase the size of zip
>> archives which used to have the complete file information in the LFH before by 16 bytes per entry. On the other hand it
>> will give us the freedom to use whatever zip implementation we like :)   [1]:
>> https://github.com/gradle/gradle/blob/e76905e3a/subprojects/build-init/src/main/java/org/gradle/api/tasks/wrapper/Wrapper.java#L152-L158
>> [2]:
>> https://android.googlesource.com/platform/tools/base/+/refs/heads/master/build-system/builder/src/main/java/com/android/builder/testing/MockableJarGenerator.java#86
>> [3]: https://bugs.openjdk.java.net/browse/JDK-8240333 [4]: https://bugs.openjdk.java.net/browse/JDK-8240235 [5]:
>> https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/ZipFile/CopyJar.java [6]:
>> https://github.com/simonis/zlib-bench/blob/master/Results.md [7]: https://en.wikipedia.org/wiki/Zip_(file_format)
>
> Volker Simonis has refreshed the contents of this pull request, and previous commits have been removed. The incremental
> views will show differences compared to the previous content of the PR. The pull request contains one new commit since
> the last revision:
>   8253952: Refine ZipOutputStream.putNextEntry() to recalculate ZipEntry's compressed size

src/java.base/share/classes/java/util/jar/JarOutputStream.java line 81:

> 79:      * any previous entry. The default compression method will be
> 80:      * used if no compression method was specified for the entry.
> 81:      * When writing a compressed (i.e. {@code ZipEntry.DEFLATED}))

I would leave as DEFLATED given it is used this way in other ZipOutputStream methods and I would also remove the  "i.e."

-------------

PR: https://git.openjdk.java.net/jdk/pull/520