RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v9]

Tue Jan 9 13:04:34 UTC 2024

On Mon, 8 Jan 2024 14:59:40 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

>> Eirik Bjørsnøs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 33 commits:
>> 
>>  - Merge branch 'master' into data-descriptor
>>  - Extract ZIP64_BLOCK_SIZE_OFFSET as a constant
>>  - A Zip64 extra field used in a LOC header must include both the uncompressed and compressed size fields, and does not include local header offset or disk start number fields. Conequently, a valid LOC Zip64 block must always be 16 bytes long.
>>  - Document better the zip command and options used to generate the test vector ZIP
>>  - Fix spelling of "presence"
>>  - Add a @bug reference in the test
>>  - Use the term "block size" when referring to the size of a Zip64 extra field data block
>>  - Update comment reflect that a Zip64 extended field in a LOC header has only two valid block sizes
>>  - Convert test from testNG to JUnit
>>  - Fix the check that the size of an extra field block size must not grow past the total extra field length
>>  - ... and 23 more: https://git.openjdk.org/jdk/compare/e2042421...ddff130f
>
> src/java.base/share/classes/java/util/zip/ZipInputStream.java line 692:
> 
>> 690:     private static boolean isZip64ExtBlockSizeValid(int blockSize) {
>> 691:         // Uncompressed and compressed size fields are 8 bytes each
>> 692:         return blockSize == 16;
> 
> I'm not following this check. As far as I can see, the `blockSize` being passed to this method is the size of the zip64 extra entry and as per the spec:
> 
> 
> 4.5.3 -Zip64 Extended Information Extra Field (0x0001):
> 
>       The following is the layout of the zip64 extended 
>       information "extra" block. 
> 
>       ....
> 
>         Value      Size       Description
>         -----      ----       -----------
> (ZIP64) 0x0001     2 bytes    Tag for this "extra" block type
>         Size       2 bytes    Size of this "extra" block
>         Original 
>         Size       8 bytes    Original uncompressed file size
>         Compressed
>         Size       8 bytes    Size of compressed data
>         Relative Header
>         Offset     8 bytes    Offset of local header record
>         Disk Start
>         Number     4 bytes    Number of the disk on which
>                               this file starts 
> 
> So shouldn't it be 8 + 8 + 8 + 4 = 28 bytes and not 16 bytes? Did I misunderstand the code or the spec?

The size of the "Zip64 Extended Information Extra Field" is determined by the following factors:

- Whether the corresponding field in the LOC or CEN record is larger than the "magic" limit (0xFFFF for "Disk Start Number", 0xFFFFFFFF for "Original Size", "Compressed Size" and "Relative Header Offset"). A component should only appear if the corresponding LOC record field is set to 0xFFFF or 0xFFFFFFFF).
- Whether the Zip64 extra field is contained in a LOC or a CEN header: While Zip64 fields in both headers may have the "Original Size" and "Compressed Size" fields, only the CEN has the corresponding "Relative Header Offset" and "Disk Start Number" fields. So these two components are not relevant in the LOC. (And indeed they are not present in the Info-ZIP test vector used in this PR. See `Zip64DataDescriptor` for the structure)

In this case, we have a Zip64 extra header in a LOC record using "data descriptors". While the spec mandates that LOC headers with the "data descriptor" bit set must have their "size" and "compressed size" fields set to zero, in this case we also have Zip64, so these fields are actually set to 0xFFFFFFFF, and zero is the value used in the Zip64 extra fields. (This means the data size will in this case will always be 16. And we could also add another check verifying that both components are zero as expected when used with a "data descriptor") 

The full text of 4.5.3:

4.5.3 -Zip64 Extended Information Extra Field (0x0001):

      The following is the layout of the zip64 extended 
      information "extra" block. If one of the size or
      offset fields in the Local or Central directory
      record is too small to hold the required data,
      a Zip64 extended information record is created.
      The order of the fields in the zip64 extended 
      information record is fixed, but the fields MUST
      only appear if the corresponding Local or Central
      directory record field is set to 0xFFFF or 0xFFFFFFFF.
``` 

It would probably be better to compute the expected size of the Zip64 data by looking at the "size" and "compressed size" in the LOC and add 8 bytes for each field being 0xFFFFFFFF. Then the expected size (8 or 16) can be compared to the actual size. 

I think some recent validation of the Zip64 extra fields in the CEN header was implemented this way. Unfortunately, these values are actually ignored by `readLOC` when it determines the "data descriptor" bit being set, so they are not available to `readEND`.

Perhaps it would be better if we moved the testing for presence of a valid Zip64 extra field into `readLOC` method and have that method set some boolean flag to indicate to readEND whether it should expect 4 or 8 byte fields in the data descriptor? 

This would allow us to check for "size" and "compressed size" being  0xFFFFFFFF AND it would alleviate concerns about trustig the ZipEntry.getExtra value.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/12524#discussion_r1446064783