RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Fri Jul 26 16:49:34 UTC 2024

On Fri, 26 Jul 2024 16:12:11 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

>> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits:
>> 
>>  - Merge branch 'master' into JDK-8322256
>>  - Wording tweak.
>>  - Change concatenation policy config from booleans to new ConcatPolicy enum.
>>  - Merge branch 'master' into JDK-8322256
>>  - Bump @since from 23 to 24.
>>  - Merge branch 'master' into JDK-8322256
>>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>>  - Simplify code by eliminating an impossible case.
>>  - Field name change and Javadoc wording tweaks.
>>  - Merge branch 'master' into JDK-8322256
>>  - ... and 6 more: https://git.openjdk.org/jdk/compare/332df83e...72b60092
>
> Hello Archie, I don't think we need any enum in the API. GZIPInputStream class should only allow parsing GZIP streams. The InputStream passed to the GZIPInputStream constructor must be a InputStream which has either 1 GZIP stream or N GZIP streams (concatenated). The user should be allowed to state whether the passed InputStream is expected contain more than 1 GZIP stream. This is the only addition that I think we should be introducing. 
> 
> The way to allow users to state that the InputStream being passed is allowed to have more than 1 GZIP stream is through the new proposed boolean to the constructor. The boolean can be named "allowConcatenated".
> 
> When allowConcatenated is true, then during the read() operation on the GZIPInputStream, when we internally read a GZIP stream trailer, we should check if the underlying InputStream has a next GZIP stream header:
> 	- If there's no bytes present after the trailer, we just return whatever data we have inflated so far and that also would be the end-of-stream of the GZIPInputStream. 
> 	- If there's bytes after the trailer, then those next bytes MUST be the next GZIP stream header. If those bytes aren't a GZIP stream header, then we throw an IOException. This implies whatever data we had deflated and any additional bytes that we had read after the trailer will not be returned to the user. That's fine because the underlying InputStream was "corrupt" - the GZIPInputStream shouldn't expect any other data other than GZIP stream(s) in the underlying InputStream.
> 
> When allowConcatenated is false, then during the read() operation on the GZIPInputStream, when we internally read a GZIP stream trailer, we should check if the underlying InputStream has any more bytes remaining:
> 	- If there are bytes remaining then we consider that an error and raise an IOException. We do this because we were instructed not to allow concatenated GZIP streams. So presence of any data after the first GZIP stream trailer should be an exception.
> 	- If there are no more bytes remaining then we have reached the end-of-stream and we return back whatever GZIP data we have inflated during the read operation and that also would be the end-of-stream of the GZIPStream instance.
> 
> The default value for allowConcatenated should be true to allow for supporting concatenated GZIP streams (like we do now). There however will be a change in behaviour if the stream content after the trailer isn't a GZIP stream. With this new proposed change we will throw an IOException ...

Hi @jaikiran,

> There however will be a change in behaviour if the stream content after the trailer isn't a GZIP stream. With this new proposed change we will throw an IOException (unlike previously). I think that's fine and in fact the correct thing to do.

Ah! This simplifies everything. I was just assuming that preserving the existing behavior was a hard requirement - but if not, we can make this class "always precise", and the only thing left to configure is whether to allow 1 stream or N streams. I am in complete agreement with you now, especially because this eliminates the aforementioned security concern.

I'm assuming you and @LanceAndersen have communicated and he is OK with this as well. I will update the PR and CSR.

Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2253125776