RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v9]

Mon Jan 8 14:50:40 UTC 2024

On Fri, 22 Dec 2023 07:55:24 GMT, Eirik Bjørsnøs <eirbjo at openjdk.org> wrote:

>> ZipInputStream.readEnd currently assumes a Zip64 data descriptor if the number of compressed or uncompressed bytes read from the inflater is larger than the Zip64 magic value.
>> 
>> While the ZIP format  mandates that the data descriptor `SHOULD be stored in ZIP64 format (as 8 byte values) when a file's size exceeds 0xFFFFFFFF`, it also states that `ZIP64 format MAY be used regardless of the size of a file`. For such small entries, the above assumption does not hold.
>> 
>> This PR augments ZipInputStream.readEnd to also assume 8-byte sizes if the ZipEntry includes a Zip64 extra information field. This brings ZipInputStream into alignment with the APPNOTE format spec:
>> 
>> 
>> When extracting, if the zip64 extended information extra 
>> field is present for the file the compressed and 
>> uncompressed sizes will be 8 byte values.
>> 
>> 
>> While small Zip64 files with 8-byte data descriptors are not commonly found in the wild, it is possible to create one using the Info-ZIP command line `-fd` flag:
>> 
>> `echo hello | zip -fd > hello.zip`
>> 
>> The PR also adds a test verifying that such a small Zip64 file can be parsed by ZipInputStream.
>
> Eirik Bjørsnøs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits:
> 
>  - Merge branch 'master' into data-descriptor
>  - Extract ZIP64_BLOCK_SIZE_OFFSET as a constant
>  - A Zip64 extra field used in a LOC header must include both the uncompressed and compressed size fields, and does not include local header offset or disk start number fields. Conequently, a valid LOC Zip64 block must always be 16 bytes long.
>  - Document better the zip command and options used to generate the test vector ZIP
>  - Fix spelling of "presence"
>  - Add a @bug reference in the test
>  - Use the term "block size" when referring to the size of a Zip64 extra field data block
>  - Update comment reflect that a Zip64 extended field in a LOC header has only two valid block sizes
>  - Convert test from testNG to JUnit
>  - Fix the check that the size of an extra field block size must not grow past the total extra field length
>  - ... and 23 more: https://git.openjdk.org/jdk/compare/e2042421...ddff130f

Hello Eirik, I had a look at this one today. The motivation behind this change appears to be the statement in the specification which says:

4.3.9  Data descriptor:

        crc-32                          4 bytes
        compressed size                 4 bytes
        uncompressed size               4 bytes

      4.3.9.1 This descriptor MUST exist if bit 3 of the general
      purpose bit flag is set.
      ...
      For ZIP64(tm) format archives, the compressed and uncompressed 
      sizes are 8 bytes each.

      4.3.9.2 When compressing files, compressed and uncompressed sizes 
      SHOULD be stored in ZIP64 format (as 8 byte values) when a 
      file's size exceeds 0xFFFFFFFF.   However ZIP64 format MAY be 
      used regardless of the size of a file.  When extracting, if 
      the zip64 extended information extra field is present for 
      the file the compressed and uncompressed sizes will be 8
      byte values.  

Effectively, this change is proposing to enhance the `java.util.zip.ZipInputStream` to allow for it to parse some more zip/jar files out there, which it currently wouldn't be parsing, even if the spec allowed for such zip files.

Looking through this proposed change, this change only affects `DEFLATED` entries and doesn't impact `STORED` entries. Furthermore, this change affects only those `DEFLATED` entries which have a data descriptor (understandly).

What's being done in this change is that, in a zip/jar file, for each `DEFLATED` entry which has a data descriptor, we now have an additional logic which decides whether we read the compressed/uncompressed sizes in the data descriptor as 4 bytes or as 8 bytes each. Before this change we used to read them as 8 bytes only when the `Inflater` (for that entry) told us that it had dealt with more than `0xFFFFFFFFL` bytes of data for that entry (indicating that this is zip64 entry). With the current proposed change, we not only rely on the `Inflater` for this decision but also rely on `ZipEntry` itself to tell us whether to read 8 bytes or 4 bytes each.

Given this context, I've looked through the changes and I think some additional changes are needed to prevent some potential issues with this proposal. I've added those comments inline.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/12524#issuecomment-1881152933