RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v5]

Mon Jul 15 10:05:06 UTC 2024

On Thu, 6 Jun 2024 17:03:57 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
> 
>  - Bump @since from 23 to 24.
>  - Merge branch 'master' into JDK-8322256
>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>  - Simplify code by eliminating an impossible case.
>  - Field name change and Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - Clarify exceptions: sometimes ZipException, sometimes EOFException.
>  - Merge branch 'master' into JDK-8322256
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/75dc2f85...f845a75b

Hello Archie, sorry it has taken long to review this PR.

The proposal here is to have `java.util.zip.GZIPInputStream` specify (and improve) how it deals with concatenated GZIP stream(s) and also allow for applications instantiating `GZIPInputStream` to decide whether or not the underlying `InputStream` passed to the `GZIPInputStream` instance is expected to contain a concatenated GZIP stream(s).

I'll include here the text that I had sent in a private communication to Archie:

I think this comes down to introducing a new optional boolean parameter to the constructor of GZIPInputStream.

The boolean when "true" will attempt to read a potential additional GZIP stream after the previous GZIP stream's trailer has been read. In that case we need to handle 2 cases - one where we successfully find the header of the next (concatenated) GZIP stream and the other where we don't find a valid GZIP stream header. In the first case where we do find a header successfully, then we continue reading the stream as usual and return the uncompressed data from the "read()" call. In the case where we fail to find a valid header and yet there were bytes past the previous GZIP stream, then I think the GZIPInputStream.read() should throw an IOException, since that stream no longer represents a GZIP stream (remember, we are talking about this only when the GZIPInputStream was constructed with the new boolean parameter = true).

Coming to the case where the GZIPInputStream was constructed using boolean = false - In this case when we reach and read the trailer of a GZIP stream, if there is no more bytes, then we consider this a completed GZIP stream and return the uncompressed data. If there is any more bytes past the GZIP stream's trailer, then I think we should throw a IOException (since we aren't expecting any concatenated GZIP stream).

As for the default value of this optional boolean parameter, I think it should default to "true" implying it will read any concatenated GZIP streams. That would match the current implementation of GZIPInputStream.read() which has the ability to read (valid) GZIP concatenated input stream.

I think this would then allow us to keep the implementation simple as well as allow the calling application to have control over whether or not the passed should be considered as having a concatenated GZIP stream.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2228129529