RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v7]

Fri Jul 26 15:47:37 UTC 2024

On Fri, 26 Jul 2024 13:36:43 GMT, Lance Andersen <lancea at openjdk.org> wrote:

>> Archie Cobbs has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits:
>> 
>>  - Merge branch 'master' into JDK-8322256
>>  - Wording tweak.
>>  - Change concatenation policy config from booleans to new ConcatPolicy enum.
>>  - Merge branch 'master' into JDK-8322256
>>  - Bump @since from 23 to 24.
>>  - Merge branch 'master' into JDK-8322256
>>  - Relabel "trailing garbage" as "extra bytes" to sound less accusatory.
>>  - Simplify code by eliminating an impossible case.
>>  - Field name change and Javadoc wording tweaks.
>>  - Merge branch 'master' into JDK-8322256
>>  - ... and 6 more: https://git.openjdk.org/jdk/compare/332df83e...72b60092
>
> First thank you for your efforts and patience with this PR Archie.
> 
> I am trying to better grasp the problem to be solved outside of better clarity surrounding how GZIPInputStream handles concatenated gzip files
> 
> - I could see possibly  providing an enum which indicates to stop decompressing after processing the first gzip file or to decompress until the end of input.  
> - This would be similar to what  the commons-compress GzipCompressorInputStream class does.
> - It is also what Eirik and Jai are suggesting I believe.  
> - If support for processing multiple gzip files is enabled, which would be the default, then the existing behavior is not changed.  
> - As I have mentioned previously, we should add more testing including taking a couple of concatenated examples created via the gzip tool (or equivalent) give how light our coverage is.
> 
> I don't believe we want to do more than is needed given the age of the API and there is been minimal requests for improvements at this time.

Hi @LanceAndersen,

Thanks for taking a look. Let's achieve consensus on what the goal(s) should be here, and then I think a proper API for that will come naturally. Up until now the discussion has been a shifting blend of API design and API goals which makes things a little bit chaotic (probably my fault).

**Proposed Goal №1:**

This one seems relatively non-controversial: Allow creation of a `GZIPInputStream` that only decompresses one GZIP stream.

OK but what does that mean exactly?
* Option A: The `GZIPInputStream` will never read past the end of the GZIP trailer frame
  * Sounds great, but how exactly would that work?
    * Option A.1: Caller must provide a `BufferedInputStream`, so we can back it up if any extra bytes have been read past the trailer frame
    * Option A.2: `GZIPInputStream` provides a way for caller to access the extra bytes read once EOF is detected
    * Option A.3: `GZIPInputStream` never buffers more than 8 bytes (size of trailer frame)
* Option B: `GZIPInputStream` attempts to read the byte after the trailer frame and throws `IOException` if EOF is not returned

My opinion: All of the A options are unsatisfactory (although A.1 is the least bad), so it seems option B is the answer here.

**Proposed Goal №2:**

This one seems more controversial: Allow the caller to configure `GZIPInputStream` for "strict mode".

Just to be clear and restate what this is referring to. Currently, `GZIPInputStream` is hard-wired for "lenient mode". This means that it is indeterminate (a) how many additional bytes (if any) were read beyond the GZIP trailer frame, and (b) whether reading stopped due to EOF, invalid data, or an underlying`IOException`.

My opinion: this is, at best, a random intermittent bug - **and at worst a giant security hole** - waiting to happen.

Reasoning: If the caller is OK with lenient mode, then the caller obviously doesn't (or shouldn't) care about the data that follows the input stream (if any), because the caller won't even know whether there was any, how much of it has been discarded (if any), or whether it was actually another valid GZIP stream that got corrupted or threw an`IOException`.

So why would any caller actually want to use this feature to concatenate anything useful after a GZIP trailer frame?

In particular: if an underlying `IOException` occurs at just the wrong time, an application that wrote `GZIP-STREAM-1`, `GZIP-STREAM-2`, `GZIP-STREAM-3` could read back `GZIP-STREAM-1`, `GZIP-STREAM-2` and never know that anything was wrong.

How useful is concatenated GZIP stream support if it's not even reliable??

This just seems completely wonky to me (and oh, did I mention it's a potential security hole waiting to happen? :)

It seems like if concatenation is going to be a supported feature, then it should be possible to access a _reliable_ version of it. In other words, there should be a way for an application to say "I want to decode exactly ONE (or exactly N) whole GZIP streams _or else guarantee me an exception_".

Maybe the answer is simply to tell people "Don't use this class, it's not reliable". I'd be OK with that I guess but it seems a bit of a cop out.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2253031572