RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v5]

Fri Jul 19 10:05:35 UTC 2024

On Mon, 15 Jul 2024 10:01:47 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

>> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
>> 
>>  - Bump @since from 23 to 24.
>>  - Merge branch 'master' into JDK-8322256
>>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>>  - Simplify code by eliminating an impossible case.
>>  - Field name change and Javadoc wording tweaks.
>>  - Merge branch 'master' into JDK-8322256
>>  - Javadoc wording tweaks.
>>  - Merge branch 'master' into JDK-8322256
>>  - Clarify exceptions: sometimes ZipException, sometimes EOFException.
>>  - Merge branch 'master' into JDK-8322256
>>  - ... and 2 more: https://git.openjdk.org/jdk/compare/75dc2f85...f845a75b
>
> Hello Archie, sorry it has taken long to review this PR.
> 
> The proposal here is to have `java.util.zip.GZIPInputStream` specify (and improve) how it deals with concatenated GZIP stream(s) and also allow for applications instantiating `GZIPInputStream` to decide whether or not the underlying `InputStream` passed to the `GZIPInputStream` instance is expected to contain a concatenated GZIP stream(s).
> 
> I'll include here the text that I had sent in a private communication to Archie:
> 
> I think this comes down to introducing a new optional boolean parameter to the constructor of GZIPInputStream.
> 
> The boolean when "true" will attempt to read a potential additional GZIP stream after the previous GZIP stream's trailer has been read. In that case we need to handle 2 cases - one where we successfully find the header of the next (concatenated) GZIP stream and the other where we don't find a valid GZIP stream header. In the first case where we do find a header successfully, then we continue reading the stream as usual and return the uncompressed data from the "read()" call. In the case where we fail to find a valid header and yet there were bytes past the previous GZIP stream, then I think the GZIPInputStream.read() should throw an IOException, since that stream no longer represents a GZIP stream (remember, we are talking about this only when the GZIPInputStream was constructed with the new boolean parameter = true).
> 
> Coming to the case where the GZIPInputStream was constructed using boolean = false - In this case when we reach and read the trailer of a GZIP stream, if there is no more bytes, then we consider this a completed GZIP stream and return the uncompressed data. If there is any more bytes past the GZIP stream's trailer, then I think we should throw a IOException (since we aren't expecting any concatenated GZIP stream).
> 
> As for the default value of this optional boolean parameter, I think it should default to "true" implying it will read any concatenated GZIP streams. That would match the current implementation of GZIPInputStream.read() which has the ability to read (valid) GZIP concatenated input stream.
> 
> I think this would then allow us to keep the implementation simple as well as allow the calling application to have control over whether or not the passed should be considered as having a concatenated GZIP stream.

This PR is open, but notifications don't seem to reach the mailing lists. I experienced a similar situation a while back where I believe @jaikiran reached out to the Skara team to get that PR out of the silent state.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2238816809