RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Lance Andersen lancea at openjdk.org
Mon Jul 29 13:08:34 UTC 2024


On Sat, 27 Jul 2024 15:00:51 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
>  - Add GZIP input test for various gzip(1) command outputs.

> > With this new proposed change we will throw an IOException (unlike previously). I think that's fine and in fact the correct thing to do.
> 
> I've run into a problem.
> 
> With this change, `GZIPInZip.java` fails; the test is annotated like this:
> 
> ```java
> /* @test
>  * @bug 7021870 8023431 8026756
>  * @summary Reading last gzip chain member must not close the input stream.
>  *          Garbage following gzip entry must be ignored.
>  */
> ```
> 
> Note the "Garbage following gzip entry must be ignored" line. This must be what bugs 8023431 8026756 refer to, but those bugs are not accessible ("You can't view this issue. It may have been deleted or you don't have permission to view it.") so I can't see what the problem was that required ignoring the extra garbage. Maybe you guys have access or know what this was about? Thanks. See also [this stackoverflow question](https://stackoverflow.com/questions/4928560/how-can-i-work-with-gzip-files-which-contain-extra-data) - it seems to be a thing with ZIP files.


I don't know the full history,  but [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425) added support for concatenated gzip files and also the code for ignoring extra trailing garbage bytes.

As part of the fix for  [JDK-8026756](https://bugs.openjdk.org/browse/JDK-8026756) a test was added for validating that trailing garbage  bytes after the end header are ignored,  but was not mentioned in the bug.

In [gzip manual](https://www.gnu.org/software/gzip/manual/gzip.html):

> 6 Using gzip on tapes
> 
> When writing compressed data to a tape, it is generally necessary to pad the output with zeroes up to a block > boundary. When the data is read and the whole block is passed to gunzip for decompression, gunzip detects
 > that there is extra trailing garbage after the compressed data and emits a warning by default if the garbage
 > contains nonzero bytes. You can use the --quiet option to suppress the warning.

Based on the above and the test that was added via [JDK-8026756](https://bugs.openjdk.org/browse/JDK-8026756),  the existing behavior for the handling of extra garbage bytes should be left as is.

So where does that leave us:

- Keep the code as is and document the current behavior
- Continue to add additional test coverage for the current API
- We probably do not need a new constructor given it probably adds no new additional value the existing API

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2255888787


More information about the core-libs-dev mailing list