RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Mon Jul 29 17:09:36 UTC 2024

On Mon, 29 Jul 2024 14:27:28 GMT, Lance Andersen <lancea at openjdk.org> wrote:

> ... we really need more datapoint to better understand the risks/benefits in order to make an informed decision.

Agreed.

Here's some what I've come up with after a little bit of research:

First, we shouldn't confuse GZIP with ZIP files. They are only indirectly related:
* The [DEFLATE format](https://en.wikipedia.org/wiki/Deflate) is a single file compression format
* The [GZIP format](https://en.wikipedia.org/wiki/Gzip) is a wrapper around the DEFLATE format.
* Individual entries in a [ZIP archive](https://en.wikipedia.org/wiki/ZIP_(file_format)) is typically DEFLATE'd, but never GZIP'd

The "overlap" that is relevant here is simply when the file you are storing in a ZIP file happens to itself be a GZIP file e.g. `foo.gz` (so yes, you are likely compressing `foo` twice).

Specifically, the failing test `GZIPInZip.java` is doing this:
1. Create a normal GZIP data file `b.gz`
1. Create a normal ZIP file
1. Write `b.gz` _plus one extra byte_ into the ZIP file
1. Close the ZIP file
1. Open the ZIP file
1. Find the `InputStream` corresponding to the entry for `b.gz`
1. Read and decompress it with a `GZIPInputStream`
1. Verify the extra byte was ignored

So this has nothing to do with the ZIP file format itself, rather it's related to the behavior of how other software might build ZIP files.

Nor does it directly relate to the GZIP format or how any GZIP decoder (taken in isolation) should behave.

So what I don't understand is what is the motivation for the behavior this test is verifying? It looks like, at best, a bug workaround for some unknown other broken software.

Here's the commit it was added:

commit 71f33254816a17ff741e0119e16db28181d7b43b
Author: Ivan Gerasimov <igerasim at openjdk.org>
Date:   Tue Oct 15 21:15:17 2013 +0400

    8023431: Test java/util/zip/GZIP/GZIPInZip.java failed

    Properly close PipedStreams. Additional testing for malformed input

    Reviewed-by: darcy, sherman

Did you double check the unviewable issues? What does JDK-8023431 say?

I also took a look at how Apache's commons-compress `GzipCompressorInputStream` behaves.

`GzipCompressorInputStream` is the commons-compress version of `GZIPInputStream`. From what I can infer, part of its original motivation was that pre-JDK7 `GZIPInputStream` didn't support concatenated streams. This class provides an optional `decompressConcatenated` boolean constructor parameter (default `false`).

So how does `GzipCompressorInputStream` behave...?

* `IOException`'s thrown by the underlying input stream are never suppressed; this includes `IOException`'s thrown while reading "extra garbage" that may follow the first (if `decompressConcatenated = false`) or any (if `decompressConcatenated = true`) GZIP trailer frame.
* When `decompressConcatenated = false`, after reading the first GZIP trailer frame, the underlying input is `reset()` if possible to the stream's boundary. If `reset()` is not supported, some indeterminate number of extra bytes will have been read (note, this exception is missing from the Javadoc).
* When `decompressConcatenated = true`, the entire input stream is read, and it must contain only whole, valid GZIP streams with no trailing garbage.

Summary: GzipCompressorInputStream is never "lenient"; it either reads the whole stream (if `decompressConcatenated = true`) or else just the first GZIP stream and stops as soon as it can thereafter (exactly if `markSupported()`, othewise inexactly).

Side note: the reliance on `mark()`/`reset()` is a nice trick so to speak: if you can provide an input stream that supports it, you get precision in return, and this is accomplished without requiring any new API surface.

This brings up the question, how would commons-compress behave in the scenario being tested by the `GZIPInZip.java` test?

Answer: Let's assume you replace `new GZIPInputStream(zis)` with `new GzipCompressorInputStream(zis)` in that test. Then the test would succeed, because:
* By default, `GzipCompressorInputStream` sets `decompressConcatenated = false`.
* The extra byte might get read, but because `decompressConcatenated = false` decompression stops after the last byte of  `b.gz`

Overall the API and behavior design of `GzipCompressorInputStream` seems like a good one. I would even recommend that we adopt it, except that it's incompatible with the current behavior (we assume `decompressConcatenated = true` by default).

So here's a stab at what we might do (opening debate): Basically, we replicate the `GzipCompressorInputStream` behavior with the exception that our `allowConcatenated` defaults to `true` to preserve existing behavior. In summary:
* We utilize `mark()`/`reset()` to guarantee precise stopping behavior when `allowConcatenated = false` - but only if `markSupported()` returns true; otherwise, best effort/imprecise stop that may read past the GZIP end.
* We eliminate all "lenient" behavior - e.g., underlying `IOException`'s are never suppressed, and if concatenation is enabled, non-empty GZIP header frames must always be valid and the entire stream is consumed.
* Update `GZIPInZip.java` to replace `new GZIPInputStream(zis)` with `new GZIPInputStream(zis, false)` to unbreak it

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2256461082