RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Wed Jul 31 22:52:32 UTC 2024

On Sat, 27 Jul 2024 15:00:51 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
>  - Add GZIP input test for various gzip(1) command outputs.

It seems like we're going around in circles a bit here... but maybe the circles are getting smaller? But I'm also running out of steam...

> We don't want to change this long standing behavior as it has the potential of breaking existing applications and it is consistent with gzip and also winzip.

Sounds great. In fact my original plan was to not change the existing behavior at all (by putting all of the new/improved behavior behind a new constructor)...

> So through this PR, we should clarify the javadoc as to what current GZIPInputStream implementation does and add additional tests to expand the coverage

Event just doing this has some subtlety to it... Note, none of the current behavior is actually specified (the entirety of the Javadoc says: "This class implements a stream filter for reading compressed data in the GZIP file format"), so any description we add creates a new actual specification for the future.

Are you sure we want to do that for all of the current behavior?

I would think maybe for some we do and for some we don't:
* Swallowing `IOException`'s - this we should not specify, so that we may someday eliminate it to make this class reliable
* Concatenated GZIP's - this would be good to document
* Reading extra bytes - the only thing to document would be that "there is a chance extra bytes will be read" which is not very useful, so what's the point?

What of that would you want to specify?

My original idea was to only add specification for the new/improved functionality, and leave the existing functionality. If we go this route, then the only thing to decide would be what new/improved functionality we want to add.

> A separate discussion can take place to discuss the merits of whether there is perceived value in throwing an IOException when trailing garbage is encountered as well as any benefit of not supporting concatenated gzip files. It will also allow time for further review of other tools/apis that support gzip to see what they may or may not do.

Guess I thought we were sort-of having those discussions now...

What do we agree on?
* Supporting concatenation is a good thing
* Supporting ignoring trailing garbage is a good thing

What is there debate about?
* Swallowing trailing garbage `IOException`'s - I think it is a bad thing, but it's required for backward compatibility
* Precise stops when `markSupported() = true` - I think it is a good thing (and it is backward compatible)
* Support for reading only one GZIP stream - I think it is a good thing, especially useful when combined with previous

IMO doing nothing is a bad option, because as it stands today the class is fundamentally unreliable, and we should at minimum provide some new way to get reliable behavior (doing so doesn't require changing the existing behavior).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2261597142