RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v5]

Eirik Bjørsnøs eirbjo at openjdk.org
Fri Jul 19 09:35:35 UTC 2024


On Thu, 6 Jun 2024 17:03:57 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
> 
>  - Bump @since from 23 to 24.
>  - Merge branch 'master' into JDK-8322256
>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>  - Simplify code by eliminating an impossible case.
>  - Field name change and Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - Javadoc wording tweaks.
>  - Merge branch 'master' into JDK-8322256
>  - Clarify exceptions: sometimes ZipException, sometimes EOFException.
>  - Merge branch 'master' into JDK-8322256
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/75dc2f85...f845a75b

I'm late to the table for this PR, but could we stop a moment to consider whether boolean constructor arguments is the best way to represent and compose optional parser configuration?

Boolean options by themselves can be tricky in that it requires knowing/understanding what the negative and positive is. Multiple constructor arguments of the same type makes it easy to mistake the order of the options while writing code. Reading the constructor with new GZIPInputStream(in, size, true, true), forces you to read the constructor documentation, figure out which of the boolean options is which, then what they represent. What if someone wants to introduce other options in the future, how will that scale?

I'm not sure what a better alternative would be, but I'm sure we must have prior art in parser configuration in the JDK?

Perhaps enums for the options, or composing constant int fields would be an improvement? 

Again, not sure what the optimal API solution is here, just wanted to raise the concern :-)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2238752137


More information about the core-libs-dev mailing list