RFR: 7036144: GZIPInputStream readTrailer uses faulty available() test for end-of-stream [v6]

Mon Feb 26 15:24:14 UTC 2024

On Mon, 26 Feb 2024 06:51:12 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

>> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
>> 
>>  - Merge branch 'master' into JDK-7036144
>>  - Merge branch 'master' into JDK-7036144
>>  - Address third round of review comments.
>>  - Address second round of review comments.
>>  - Address review comments.
>>  - Fix bug in GZIPInputStream when underlying available() returns short.
>
> Hello Archie, the proposal to not depend on the `available()` method of the underlying `InputStream` to decide whether to read additional bytes from the underlying stream to detect the "next" header seems reasonable.
> 
> What's being proposed here is that we proceed and read the underlying stream's few additional bytes to detect the presence or absence of a GZIP member header and if that attempt fails (with an IOException) then we consider that we have reached the end of GZIP stream and just return back.  
> 
> For this change, I think we would also need to consider whether we should "unread" the read bytes from the `InputStream` if those don't correspond to a "next" GZIP member header. That way any underlying `InputStream` which was implemented in a way that it would return availability as 0 when it knew that the GZIP stream was done and yet had additional (non GZIP) data to read on the underlying stream, would still be able to read that data after this change. It's arguable whether we should have been doing that "unread" even when we were doing the `available() > 0` check and the decision that comes out of https://bugs.openjdk.org/browse/JDK-8322256 might cover that.

Hi @jaikiran,

I agree with your comments. My only question is whether we should do all of this in one stage or two stages.

My initial thought is to do this in two stages:
* A narrow fix for the bug described here as implemented by this PR
* A larger change (requiring a separate bug, CSR, and PR) to:
  * More precisely define and specify the expected behavior, with support for concatenated streams
  * Eliminate situations where we read beyond the end-of-stream (i.e., "unreading" if/when necessary)

The reason I think this two stage approach is appropriate is because there is no downside to doing it this way - that is, the problem you describe of reading beyond the end-of-stream is _already_ a problem in the current code, with the exception of the one corner case where this bug fix applies, namely, when `in.available()` returns zero and yet there actually _is_ more data available.

Your thoughts?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17113#issuecomment-1964372772