RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Fri Aug 2 13:14:33 UTC 2024

On Sat, 27 Jul 2024 15:00:51 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
>  - Add GZIP input test for various gzip(1) command outputs.

Hello Archie,

> It seems like we're going around in circles a bit here... but maybe the circles are getting smaller?

We try to avoid/reduce changes in zip, gzip and other similar areas because of several open to interpretation parts of the specification and the fact that there are tools outside of the JDK (like the one we found out from Lance's example) which behave in a certain manner. I agree we sometimes do end up having to re-evaluate the original proposal. It's not a bad thing though. In fact the current discussion in this PR and the experiments have been useful to go back and check why the code behaves the way it currently does.

> IMO doing nothing is a bad option, because as it stands today the class is fundamentally unreliable

Lance, me and others have been discussing the changes in this PR for the past several days. We started this off with an intention to introduce a new constructor to allow more tighter control over not processing the "corrupt" bytes after a GZIP trailer. Experiments with the other tools (gunzip, winzip and even 7zip) has shown us that any garbage/corrupt data after a GZIP trailer is considered equivalent to end of stream and any deflated content thus far is returned from that stream - just like what the `GZIPInputStream` currently does in mainline. So its behaviour is at least consistent with these other commonly used tools. 

> Are you sure we want to do that for all of the current behavior?
>
> I would think maybe for some we do and for some we don't:
>
>    Swallowing IOException's - this we should not specify, so that we may someday eliminate it to make this class reliable
>    Concatenated GZIP's - this would be good to document
>    Reading extra bytes - the only thing to document would be that "there is a chance extra bytes will be read" which is not very useful, so what's the point?
>
> What of that would you want to specify?

The `GZIPInputStream` currently states:

> This class implements a stream filter for reading compressed data in the GZIP file format.

and that's it, nothing more. What we want to do is specify that the InputStream that is passed to the GZIPInputStream constructor is allowed to have one or more GZIP streams. A stream with more than one GZIP stream is considered a concatenated GZIP stream. The GZIPInputStream will stop processing any further content after an individual GZIP stream's trailer if that content doesn't represent a valid GZIP stream header of the next GZIP stream. 
(Of course, it might need rewording a bit with better grammar)

I think specifying these parts in the javadoc will clarify the current behaviour without getting too much into the implementation details.

> But I'm also running out of steam...

I'm sorry you had to go back and forth on some of these proposals and I understand it can be confusing and demotivating. I think to take this exercise to conclusion, we should only focus on specifying the current behaviour. Separately expanding the test coverage would be the next important thing (I'm not saying you should be doing it).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2265378324