RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Archie Cobbs acobbs at openjdk.org
Fri Aug 2 18:22:33 UTC 2024


On Fri, 2 Aug 2024 13:12:00 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

>> Archie Cobbs has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
>>  - Add GZIP input test for various gzip(1) command outputs.
>
> Hello Archie,
> 
>> It seems like we're going around in circles a bit here... but maybe the circles are getting smaller?
> 
> We try to avoid/reduce changes in zip, gzip and other similar areas because of several open to interpretation parts of the specification and the fact that there are tools outside of the JDK (like the one we found out from Lance's example) which behave in a certain manner. I agree we sometimes do end up having to re-evaluate the original proposal. It's not a bad thing though. In fact the current discussion in this PR and the experiments have been useful to go back and check why the code behaves the way it currently does.
> 
>> IMO doing nothing is a bad option, because as it stands today the class is fundamentally unreliable
> 
> Lance, me and others have been discussing the changes in this PR for the past several days. We started this off with an intention to introduce a new constructor to allow more tighter control over not processing the "corrupt" bytes after a GZIP trailer. Experiments with the other tools (gunzip, winzip and even 7zip) has shown us that any garbage/corrupt data after a GZIP trailer is considered equivalent to end of stream and any deflated content thus far is returned from that stream - just like what the `GZIPInputStream` currently does in mainline. So its behaviour is at least consistent with these other commonly used tools. 
> 
> 
>> Are you sure we want to do that for all of the current behavior?
>>
>> I would think maybe for some we do and for some we don't:
>>
>>    Swallowing IOException's - this we should not specify, so that we may someday eliminate it to make this class reliable
>>    Concatenated GZIP's - this would be good to document
>>    Reading extra bytes - the only thing to document would be that "there is a chance extra bytes will be read" which is not very useful, so what's the point?
>>
>> What of that would you want to specify?
> 
> The `GZIPInputStream` currently states:
> 
>> This class implements a stream filter for reading compressed data in the GZIP file format.
> 
> and that's it, nothing more. What we want to do is specify that the InputStream that is passed to the GZIPInputStream constructor is allowed to have one or more GZIP streams. A stream with more than one GZIP stream is considered a concatenated GZIP stream. The GZIPInputStream will stop processing any further content after an individual GZIP stream's trailer if that content doesn't represent a valid GZIP stream header of the next GZIP stream. 
> (Of course, it mi...

Hi @jaikiran,

Thanks. Sorry to sound frustrated. It's not due to the back & forth with you guys but more because there are several subtle, overlapping technical issues (and judgement calls) and hashing it all out over the simplex channel of github PR comments is inherently cumbersome.

I do want to clarify one thing...

> Experiments with the other tools (gunzip, winzip and even 7zip) has shown us that any garbage/corrupt data after a GZIP trailer is considered equivalent to end of stream and any deflated content thus far is returned from that stream - just like what the GZIPInputStream currently does in mainline. So its behaviour is at least consistent with these other commonly used tools.

Yes I agree with that. My main worry is with the swallowing of `IOException`'s from the underlying stream. In theory a carefully timed malicious TCP reset could result in truncated data (even encrypted data) being read without any error reported. For some reason I'm morally offended by this :) I'm also skeptical there is code out there that actually relies on this part of the current behavior. As a result I think this particular behavior should be at least a candidate for deprecation, if not outright fixing or an alternative provided. But of course I could be wrong and realize my opinion is just one of many.

> What we want to do is specify that the InputStream that is passed to the GZIPInputStream constructor is allowed to have one or more GZIP streams. A stream with more than one GZIP stream is considered a concatenated GZIP stream. The GZIPInputStream will stop processing any further content after an individual GZIP stream's trailer if that content doesn't represent a valid GZIP stream header of the next GZIP stream.

That sounds reasonable. Importantly it doesn't specify the swallowing of `IOException`'s, which means hopefully someday we can stop doing that...

Would you guys recommend including some kind of additional statement like this?

> If an `IOException` is thrown while attempting to read or decode a GZIP header following a previous GZIP stream's trailer, it is indeterminate whether that exception is propagated to the caller, or just EOF is returned.

Once you guys let me know what you think about that point I'll update this PR to just be a Javadoc change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2265914557


More information about the core-libs-dev mailing list