RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]

Lance Andersen lancea at openjdk.org
Tue Jul 30 17:38:36 UTC 2024


On Sat, 27 Jul 2024 15:00:51 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:

>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>> 
>> There are several issues with this:
>> 
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>> 
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>> 
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
>  - Add GZIP input test for various gzip(1) command outputs.

Consider the following simple gzip test

gz % cat batman.txt 
I am batman
gz % cat hello.txt 
hello
 % cat robin.txt 
Robin Here
gz  % cat Joker.txt                          
I am the Joker

gz % gzip -c batman.txt hello.txt >> bats.gz
gz % gunzip -c bats                         
I am batman
hello

gz % cp bats.gz joker.gz                                           

gz % cat Joker.txt >> joker.gz                                           
gz % gunzip -c joker                                                     
I am batman
hello
gunzip: joker.gz: trailing garbage ignored

gz % cp joker.gz badRobin.gz                                             
gz % gzip -c robin.txt >> badRobin.gz                                             
gz % gunzip -c badRobin                                                           
I am batman
hello
gunzip: badRobin.gz: trailing garbage ignored

Here is the hexdump of the badRobin.gz

gz % hexdump -C badRobin.gz 
00000000  1f 8b 08 08 81 13 a9 66  00 03 62 61 74 6d 61 6e  |.......f..batman|
00000010  2e 74 78 74 00 f3 54 48  cc 55 48 4a 2c c9 4d cc  |.txt..TH.UHJ,.M.|
00000020  e3 02 00 f0 49 dd 72 0c  00 00 00 1f 8b 08 08 8f  |....I.r.........|
00000030  13 a9 66 00 03 68 65 6c  6c 6f 2e 74 78 74 00 cb  |..f..hello.txt..|
00000040  48 cd c9 c9 e7 02 00 20  30 3a 36 06 00 00 00 49  |H...... 0:6....I|
00000050  20 61 6d 20 74 68 65 20  4a 6f 6b 65 72 0a 1f 8b  | am the Joker...|
00000060  08 08 66 0f a9 66 00 03  72 6f 62 69 6e 2e 74 78  |..f..f..robin.tx|
00000070  74 00 0b ca 4f ca cc 53  f0 48 2d 4a e5 02 00 b3  |t...O..S.H-J....|
00000080  74 fc c1 0b 00 00 00                              |t......|
00000087


The following program will use GZIPInputStream and GzipCompressorinputStream to access badRobin.gz

package gzip;

import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream;
import org.testng.annotations.Test;

import java.io.FileInputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;

public class GZipInputStreamTest {
    @Test
    public void openWithGZipInputStream() throws IOException {
        String f = "gz/badRobin.gz";
        try (var is = new FileInputStream(f); var gis = new GZIPInputStream(is)) {
            var bytes = gis.readAllBytes();
            System.out.printf("contents: %s%n", new String(bytes));
        }
    }
    @Test
    public void openWithGzipCompressorInputStream() throws IOException {
        String f = "gz/badRobin.gz";;
        try (var is = new FileInputStream(f);
             var giz = new  GzipCompressorInputStream(is, true)) {
            var bytes = giz.readAllBytes();
            System.out.printf("contents: %s%n", new String(bytes));
        }
    }
}


The output from openWithGZipInputStream():


contents: I am batman
hello



The output from openWithGzipCompressorInputStream():


java.io.IOException: Garbage after a valid .gz stream

	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.init(GzipCompressorInputStream.java:193)
	at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:356)
	at java.base/java.io.InputStream.readNBytes(InputStream.java:412)
	at java.base/java.io.InputStream.readAllBytes(InputStream.java:349)
	at gzip.GZipInputStreamTest.openWithGzipCompressorInputStream(GZipInputStreamTest.java:31)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:599)
	at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
	at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
	at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
	at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
	at org.testng.TestRunner.privateRun(TestRunner.java:764)
	at org.testng.TestRunner.run(TestRunner.java:585)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
	at org.testng.SuiteRunner.run(SuiteRunner.java:286)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
	at org.testng.TestNG.runSuites(TestNG.java:1069)
	at org.testng.TestNG.run(TestNG.java:1037)
	at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
	at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:105)




The behavior of GZIPInputStream is in-line with gzip and gunzip where it stops processing after encountering bad data after the  8 byte end trailer containing the CRC32 and ISIZE fields .  GzipCompressorInputStream throws an IOException without returning any data.

WinZip on my Mac also opens badRobin.gz  similar to gunzip, that is does not return an error.

Based on the above, I am reluctant to change the current behavior given it appears to have been modeled after gzip/gunzip as well as WinZip.

As far as providing a means to fail whenever there is bad data after the end trailer, I think we would need more input on the merits supporting this and how the behavior would be model if GZIPInputStream was enhanced.

> Did you double check the unviewable issues? What does JDK-8023431 say?

It is just a fix for an internal intermittent test run failure.  No additional details

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2258868069


More information about the core-libs-dev mailing list