RFR: 8303866: Allow ZipInputStream.readEnd to parse small Zip64 ZIP files [v9]

Wed Jan 10 17:04:29 UTC 2024

On Mon, 8 Jan 2024 18:21:06 GMT, Joe Darcy <darcy at openjdk.org> wrote:

> Sounds like a CSR is needed for the behavioral change proposed here.

Thanks for flagging this @jddarcy 

I'm personally not 100% convinced a CSR is warranted in this case, but I'm of course happy to extend the following into a CSR if you or other reviewer think that's the best way forward:  

Let's review the intended and unintended behavioral changes in this PR:  

The _intended_ behavioral change is to extend the set of ZIP entries parsable by `ZipInputStream` to include "small Zip64 entries". Such entries meet the following criteria:

1. They are clearly marked as using the Zip64 format, meaning that the LOC's 'compressed size' and 'uncompressed size' fields are set to the "Zip64 magic value" 0xFFFFFFFF and that the LOC's extra field includes a valid _Zip64 Extended Information Field_.
2. They use _streaming mode_, meaning that the 'general purpose bit flag' 3 is set, that the Zip64 'Original Size' and 'Compressed Size' are both set to zero and that file data is followed by a 'Data Descriptor' containing the actual size values.
3. The Data Descriptor represents the 'compressed size' and 'uncompressed' size fields using 8 byte fields, instead of the regular 4 byte fields.

These entires are currently not parsable by `ZipInputStream`. Trying to parse them fails with an exception similar to the following:

java.util.zip.ZipException: invalid entry size (expected 0 but got 6 bytes)
	at java.base/java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:616)

The _unintended_ behavioral change in this PR is that ZIP entries marked as Zip64, but still having a Data Descriptor using 4 bytes to represent the size fields will now become unparsable. Such entries meet critera 1 and 2 above, but not 3.

While I have not performed a corpus search, I believe such entries are rare, if they exist at all:

- They would be in clear violation of the `APPNOTE.txt` spec, which in 4.3.9.2 says  _ZIP64 format MAY be used regardless of the size of a file.  When extracting, if the zip64 extended information extra field is present for the file the compressed and uncompressed sizes will be 8 byte values_
- The `zipdetails` tool fails to parse such a file, instead reading the two 4-byte numbers into one 8-byte 'compressed size'
- The file cannot be parsed by external APIs similar to `ZipInputStream`, such as the Python `stream-unzip` module.

To avoid unintended behavioral changes in the implementation, care has been taken to make the Zip64 extra field validation code robust to invalid input, including out of bounds array access. 

To summarize, this PR aims to allow a set of useful, valid ZIP files to be parsed, at the cost of potentially making a small set of invalid ZIP files unparsable.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/12524#issuecomment-1885245195