RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v12]

Tue Jan 16 13:45:27 UTC 2024

On Wed, 10 Jan 2024 13:39:52 GMT, Eirik Bjørsnøs <eirbjo at openjdk.org> wrote:

>> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the number of compressed or uncompressed bytes read from the inflater is larger than the Zip64 magic value.
>> 
>> While the ZIP format  mandates that the data descriptor `SHOULD be stored in ZIP64 format (as 8 byte values) when a file's size exceeds 0xFFFFFFFF`, it also states that `ZIP64 format MAY be used regardless of the size of a file`. For such small entries, the above assumption does not hold.
>> 
>> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the ZipEntry includes a Zip64 extra information field AND the 'compressed size' and 'uncompressed size' have the expected Zip64 "magic" value 0xFFFFFFFF. This brings ZipInputStream into alignment with the APPNOTE format spec:
>> 
>> 
>> When extracting, if the zip64 extended information extra 
>> field is present for the file the compressed and 
>> uncompressed sizes will be 8 byte values.
>> 
>> 
>> While small Zip64 files with 8-byte data descriptors are not commonly found in the wild, it is possible to create one using the Info-ZIP command line `-fd` flag:
>> 
>> `echo hello | zip -fd > hello.zip`
>> 
>> The PR also adds a test verifying that such a small Zip64 file can be parsed by ZipInputStream.
>
> Eirik Bjørsnøs has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Remove trailing whitespace
>  - Remove trailing whitespace

src/java.base/share/classes/java/util/zip/ZipInputStream.java line 534:

> 532: 
> 533:         long csize = get32(tmpbuf, LOCSIZ);
> 534:         long size = get32(tmpbuf, LOCLEN);

Hello Eirik, I suspect this part of the change has an issue. Before reading the `tmpbuf` for compressed and uncompressed sizes, there will be 32 bits of CRC, which should be read first. This now skips those 32 CRC bits and reads them (in the else block) after reading these sizes and that can cause incorrect LOC data.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1453442814