[PATCH] JDK-8081450: GZIPInputStream prematurely infers end-of-stream

Thu Jan 5 05:58:31 UTC 2017

Greetings,

## Problem

I have been working with Common Crawl
<http://commoncrawl.org/the-data/get-started/> archives which are files
consisting of 50,000 to 55,000 small GZIP'd files concatenated together and
made available through Amazon S3. When decompressing these files on the fly
using GZIPInputStream it consistently ends prematurely as per bug
JDK-8081450 <https://bugs.openjdk.java.net/browse/JDK-8081450>. Details on
what I was trying to do and some demonstration code can be found on Stack
Overflow
<http://stackoverflow.com/questions/41400810/gzipinputstream-closes-prematurely-when-decompressing-httpinputstream>.


## Root Cause

The root cause seems to be line 231 in GZIPInputStream.java
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/0e4fc29a5ce4/src/share/classes/java/util/zip/GZIPInputStream.java>
method readTrailer()

if (this.in.available() > 0 || n > 26) {
>
>
where *this.in <http://this.in>* is the input stream passed to the
constructor. readTrailer() calls available() to determine if there are
bytes left in the input stream that should be checked to see if there is an
additional GZIP file in the stream. Some input streams, like
ByteArrayInputStream which is used in the itreg test
GZIPInputStreamRead.java
<http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/0e4fc29a5ce4/test/java/util/zip/GZIP/GZIPInputStreamRead.java>,
return the number of bytes available to read. However, input streams like
SequenceInputStream or HTTPInputStream will return 0 if a blocking IO call
is needed to refill the buffer. For these types of input streams when the
last thing in the read buffer is the end of a GZIP file the method
.available() will return 0 and GZIPInputStream will end prematurely instead
of continuing even if the stream is not closed.

## Proposed Solution

After a brief review of InputStream subclasses it seems impossible in
general to check if more bytes are available in an input stream without
trying to read the stream. So instead of checking available() just assume
that there is another GZIP file in the stream and call readHeader(). If the
input stream is in fact empty readHeader() properly handles the end of the
stream.

## Patch and jtreg test

The attached file *GZIPInputStream.java.patch.txt* implements the proposed
solution. The patch was applied against the jdk8u jdk repository changeset
12039:e5b1823a897e (Dec 20).

The attached file *GZIPHttpStreamRead.java* is a jtreg test that is a
modified copy of the existing GZIPInputStreamRead.java test. In this test
the ByteArrayInputStream is wrapped in a very poor mock of an
HTTPInputStream. The mock always returns 0 when available() is called which
causes the test to fail eventually when reading the randomly generated
concatenated GZIP files.

The attached file *GZIPBufferEndsAtBoundry.java* is a jtreg test using an
input stream that intentionally returns only the first complete GZIP file
in a concatenated GZIP stream and then available() returns zero. It will
not return any content from the next GZIP file until read() is called
again. This is my attempt to explicitly reproduce the root cause.

Before the patch is applied the two new tests fail. After the patch is
applied the two new tests succeed and the other four tests still succeed.

The patch also addresses the demonstration code submitted in the bug report
for JDK-8081450 <https://bugs.openjdk.java.net/browse/JDK-8081450>. After
the patch is applied the demonstration code returns "hello world" as
expected.

## Work Around

This problem is not too difficult to work around. I created a subclass of
InputStream that wrapped the HTTPInputStream but overrode .available() so
that it always return a number > 0. Then GZIPInputStream always read to the
end of the stream correctly. Example work around code is in my answer on Stack
Overflow
<http://stackoverflow.com/questions/41400810/gzipinputstream-closes-prematurely-when-decompressing-httpinputstream>
.

Thank you,

Jeff Harwell
-------------- next part --------------
# HG changeset patch
# User Jeff Harwell <jharwell at fuller.edu>
# Date 1483565634 28800
#      Wed Jan 04 13:33:54 2017 -0800
# Node ID 97437f51e996766c89de2e0a8662f58e3c0d45cd
# Parent  a0c6f393b603a858ea970e3cd1d256ffd3be6789
Patched GZIPInputStream to pass new Test Suite

diff --git a/src/share/classes/java/util/zip/GZIPInputStream.java b/src/share/classes/java/util/zip/GZIPInputStream.java
--- a/src/share/classes/java/util/zip/GZIPInputStream.java
+++ b/src/share/classes/java/util/zip/GZIPInputStream.java
@@ -224,23 +224,28 @@
             (readUInt(in) != (inf.getBytesWritten() & 0xffffffffL)))
             throw new ZipException("Corrupt GZIP trailer");
 
-        // If there are more bytes available in "in" or
-        // the leftover in the "inf" is > 26 bytes:
-        // this.trailer(8) + next.header.min(10) + next.trailer(8)
-        // try concatenated case
-        if (this.in.available() > 0 || n > 26) {
-            int m = 8;                  // this.trailer
-            try {
-                m += readHeader(in);    // next.header
-            } catch (IOException ze) {
-                return true;  // ignore any malformed, do nothing
-            }
-            inf.reset();
-            if (n > m)
-                inf.setInput(buf, len - n + m, n - m);
-            return false;
+        // We cannot know if the input stream has remaining bytes so
+        // assume that it does and try to read the next header. If there isn't
+        // one we will get a EOFException (see this.readUByte). Catch 
+        // this exception and return true (eos reached)
+        int m = 8;                     // this.trailer
+        try {
+            m += readHeader(in);       // next.header
+        } catch (EOFException ze) {
+            // We are at the end of the stream
+            return true;
+        } catch (IOException ze) {
+            // We were not at the end of the stream but what was 
+            // left in the stream didn't make any sense, ignore it
+            // we already have what we came for
+            return true;
         }
-        return true;
+        // We found a valid header for an additional GZIP file
+        inf.reset();
+        if (n > m) {
+            inf.setInput(buf, len - n + m, n - m);
+        }
+        return false; // there is more GZIP data, keep going (concatenated gzip data set)
     }
 
     /*