RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Fri Jul 26 16:14:34 UTC 2024

On Wed, 24 Jul 2024 14:57:05 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits:
> 
>  - Merge branch 'master' into JDK-8322256
>  - Wording tweak.
>  - Change concatenation policy config from booleans to new ConcatPolicy enum.
>  - Merge branch 'master' into JDK-8322256
>  - Bump @since from 23 to 24.
>  - Merge branch 'master' into JDK-8322256
>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>  - Simplify code by eliminating an impossible case.
>  - Field name change and Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - ... and 6 more: https://git.openjdk.org/jdk/compare/332df83e...72b60092

Hello Archie, I don't think we need any enum in the API. GZIPInputStream class should only allow parsing GZIP streams. The InputStream passed to the GZIPInputStream constructor must be a InputStream which has either 1 GZIP stream or N GZIP streams (concatenated). The user should be allowed to state whether the passed InputStream is expected contain more than 1 GZIP stream. This is the only addition that I think we should be introducing. 

The way to allow users to state that the InputStream being passed is allowed to have more than 1 GZIP stream is through the new proposed boolean to the constructor. The boolean can be named "allowConcatenated".

When allowConcatenated is true, then during the read() operation on the GZIPInputStream, when we internally read a GZIP stream trailer, we should check if the underlying InputStream has a next GZIP stream header:
	- If there's no bytes present after the trailer, we just return whatever data we have inflated so far and that also would be the end-of-stream of the GZIPInputStream. 
	- If there's bytes after the trailer, then those next bytes MUST be the next GZIP stream header. If those bytes aren't a GZIP stream header, then we throw an IOException. This implies whatever data we had deflated and any additional bytes that we had read after the trailer will not be returned to the user. That's fine because the underlying InputStream was "corrupt" - the GZIPInputStream shouldn't expect any other data other than GZIP stream(s) in the underlying InputStream.

When allowConcatenated is false, then during the read() operation on the GZIPInputStream, when we internally read a GZIP stream trailer, we should check if the underlying InputStream has any more bytes remaining:
	- If there are bytes remaining then we consider that an error and raise an IOException. We do this because we were instructed not to allow concatenated GZIP streams. So presence of any data after the first GZIP stream trailer should be an exception.
	- If there are no more bytes remaining then we have reached the end-of-stream and we return back whatever GZIP data we have inflated during the read operation and that also would be the end-of-stream of the GZIPStream instance.

The default value for allowConcatenated should be true to allow for supporting concatenated GZIP streams (like we do now). There however will be a change in behaviour if the stream content after the trailer isn't a GZIP stream. With this new proposed change we will throw an IOException (unlike previously). I think that's fine and in fact the correct thing to do. It of course will need a CSR and a release note to note this change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2253075317