RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Lance Andersen lancea at openjdk.org
Fri Jul 26 13:39:42 UTC 2024


On Wed, 24 Jul 2024 14:57:05 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits:
> 
>  - Merge branch 'master' into JDK-8322256
>  - Wording tweak.
>  - Change concatenation policy config from booleans to new ConcatPolicy enum.
>  - Merge branch 'master' into JDK-8322256
>  - Bump @since from 23 to 24.
>  - Merge branch 'master' into JDK-8322256
>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>  - Simplify code by eliminating an impossible case.
>  - Field name change and Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - ... and 6 more: https://git.openjdk.org/jdk/compare/332df83e...72b60092

First thank you for your efforts and patience with this PR Archie.

I am trying to better grasp the problem to be solved outside of better clarity surrounding how GZIPInputStream handles concatenated gzip files

- I could see possibly  providing an enum which indicates to stop decompressing after processing the first gzip file or to decompress until the end of input.  
- This would be similar to what  the commons-compress GzipCompressorInputStream class does.
- It is also what Eirik and Jai are suggesting I believe.  
- If support for processing multiple gzip files is enabled, which would be the default, then the existing behavior is not changed.  
- As I have mentioned previously, we should add more testing including taking a couple of concatenated examples created via the gzip tool (or equivalent) give how light our coverage is.

I don't believe we want to do more than is needed given the age of the API and there is been minimal requests for improvements at this time.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2252783484


More information about the core-libs-dev mailing list