RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v5]

Mon Jul 15 18:09:52 UTC 2024

On Mon, 15 Jul 2024 13:12:41 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

>> src/java.base/share/classes/java/util/zip/GZIPInputStream.java line 153:
>> 
>>> 151:      */
>>> 152:     public GZIPInputStream(InputStream in, int size,
>>> 153:             boolean allowConcatenation, boolean ignoreExtraBytes) throws IOException {
>> 
>> I haven't reviewed the javadoc changes. I will do that separately once we settle down on the API and implementation changes.
>> As for this new constructor, I think the only new parameter we should introduce is whether or not the underlying input stream `in` is expected/allowed to have concatenated GZIP stream(s). So I think we should remove the "ignoreExtraBytes" new parameters from this constructor and. Additionally, I think the `allowConcatenation` should instead be named `allowConcatenatedGZIPStream`:
>> 
>> 
>> public GZIPInputStream(InputStream in, int size, boolean allowConcatenatedGZIPStream) throws IOException {
>
> Just to provide a more concrete input, here's a very minimal tested version of what I had in mind:
> 
> 
> --- a/src/java.base/share/classes/java/util/zip/GZIPInputStream.java
> +++ b/src/java.base/share/classes/java/util/zip/GZIPInputStream.java
> @@ -55,6 +55,8 @@ public class GZIPInputStream extends InflaterInputStream {
>  
>      private boolean closed = false;
>  
> +    private final boolean allowConcatenatedGZIPStream;
> +
>      /**
>       * Check to make sure that this stream has not been closed
>       */
> @@ -76,14 +78,7 @@ private void ensureOpen() throws IOException {
>       * @throws    IllegalArgumentException if {@code size <= 0}
>       */
>      public GZIPInputStream(InputStream in, int size) throws IOException {
> -        super(in, createInflater(in, size), size);
> -        usesDefaultInflater = true;
> -        try {
> -            readHeader(in);
> -        } catch (IOException ioe) {
> -            this.inf.end();
> -            throw ioe;
> -        }
> +        this(in, size, true);
>      }
>  
>      /*
> @@ -111,7 +106,28 @@ private static Inflater createInflater(InputStream in, int size) {
>       * @throws    IOException if an I/O error has occurred
>       */
>      public GZIPInputStream(InputStream in) throws IOException {
> -        this(in, 512);
> +        this(in, 512, true);
> +    }
> +
> +    /**
> +     * WIP
> +     * @param in the input stream
> +     * @param size the input buffer size
> +     * @param allowConcatenatedGZIPStream true if the input stream is allowed to contain
> +     *                                    concatenated GZIP streams. false otherwise
> +     * @throws IOException if an I/O error has occurred
> +     */
> +    public GZIPInputStream(InputStream in, int size, boolean allowConcatenatedGZIPStream)
> +            throws IOException {
> +        super(in, createInflater(in, size), size);
> +        this.allowConcatenatedGZIPStream = allowConcatenatedGZIPStream;
> +        usesDefaultInflater = true;
> +        try {
> +            readHeader(in);
> +        } catch (IOException ioe) {
> +            this.inf.end();
> +            throw ioe;
> +        }
>      }
>  
>      /**
> @@ -150,10 +166,45 @@ public int read(byte[] buf, int off, int len) throws IOException {
>          }
>          int n = super.read(buf, off, len);
>          if (n == -1) {
> -            if (readTrailer())
> +            // reading of deflated data is now complete, we now read the GZIP stream trailer
> +            readTrailer();
> +            if (!allowConcatenatedGZIPStream) {
>                  eos = true;
> -            else
> +            ...

Hi @jaikiran,

>  I think the only new parameter we should introduce is whether or not the underlying input stream in is expected/allowed to have concatenated GZIP stream(s).

I disagree with you here. Based on what I think the goal is here, we need two separate flags.

Here's my thinking, please see if this makes sense to you...

The overall problem being addressed here is that this class does not give the user precise control of how the input is to be decoded. In particular, it fails in two ways:
* It doesn't give the user the ability to stop reading (precisely) after the first GZIP stream
* It doesn't give the user the ability to detect (some) erroneous inputs after the first GZIP stream

Detecting erronous input is just a basic capability that any "decoder" should have. For example, we would never accept a class file reader that failed to detect and report unknown bytecode.

While this class does correctly detect invalid GZIP data frames that appear at the start of the overall input, or in the middle of a compressed GZIP stream, in trying to be "smart" by automatically handling concatenated GZIP streams, it ends up failing to detect corruption where extra bytes are added to the end of the input.

Instead, it will randomly do one of the following:
  * Completely ignore the corruption, or
  * Throw an exception, or
  * Misinterpret the extra bytes as the valid start of a concatenated GZIP input stream, and then...
    * Return zero or more bytes that were never actually written, and then...
      * Completely ignore the corruption, or
      * Throw an exception

That behavior is totally bogus!

So what is the goal here?

The first goal is that we should exactly preserve the current behavior so as to not break anything out there.

The second goal is to allow the user to specify that they want to read and decode an _exact whole number_ of concatenated GZIP input streams (either 1 or N), or else get an exception - period.

Do you not agree?

If you agree on that goal, from what I can tell, it doesn't seem possible to achieve it, while also maintaining backward compatibility, without having two flags.

Your patch is an example - it may fail to detect and report trailing garbage when in `allowConcatenatedGZIPStream` mode  (noted by your comment "TODO: somehow verify that this failed when reading the first byte of the header?")

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18385#discussion_r1678212410