RFR: 8322256: Define and document GZIPInputStream concatenated stream semantics [v8]
Lance Andersen
lancea at openjdk.org
Tue Jul 30 17:38:36 UTC 2024
On Sat, 27 Jul 2024 15:00:51 GMT, Archie Cobbs <acobbs at openjdk.org> wrote:
>> `GZIPInputStream` supports reading data from multiple concatenated GZIP data streams since [JDK-4691425](https://bugs.openjdk.org/browse/JDK-4691425). In order to do this, after a GZIP trailer frame is read, it attempts to read a GZIP header frame and, if successful, proceeds onward to decompress the new stream. If the attempt to decode a GZIP header frame fails, or happens to trigger an `IOException`, it just ignores the trailing garbage and/or the `IOException` and returns EOF.
>>
>> There are several issues with this:
>>
>> 1. The behaviors of (a) supporting concatenated streams and (b) ignoring trailing garbage are not documented, much less precisely specified.
>> 2. Ignoring trailing garbage is dubious because it could easily hide errors or other data corruption that an application would rather be notified about. Moreover, the API claims that a `ZipException` will be thrown when corrupt data is read, but obviously that doesn't happen in the trailing garbage scenario (so N concatenated streams where the last one has a corrupted header frame is indistinguishable from N-1 valid streams).
>> 3. There's no way to create a `GZIPInputStream` that does _not_ support stream concatenation.
>>
>> On the other hand, `GZIPInputStream` is an old class with lots of existing usage, so it's important to preserve the existing behavior, warts and all (note: my the definition of "existing behavior" here includes the bug fix in [JDK-7036144](https://bugs.openjdk.org/browse/JDK-7036144)).
>>
>> So this patch adds a new constructor that takes two new parameters `allowConcatenation` and `allowTrailingGarbage`. The legacy behavior is enabled by setting both to true; otherwise, they do what they sound like. In particular, when `allowTrailingGarbage` is false, then the underlying input must contain exactly one (if `allowConcatenation` is false) or exactly N (if `allowConcatenation` is true) concatenated GZIP data streams, otherwise an exception is guaranteed.
>
> Archie Cobbs has updated the pull request incrementally with two additional commits since the last revision:
>
> - Refactor to eliminate "lenient" mode (still failng test: GZIPInZip.java).
> - Add GZIP input test for various gzip(1) command outputs.
Consider the following simple gzip test
gz % cat batman.txt
I am batman
gz % cat hello.txt
hello
% cat robin.txt
Robin Here
gz % cat Joker.txt
I am the Joker
gz % gzip -c batman.txt hello.txt >> bats.gz
gz % gunzip -c bats
I am batman
hello
gz % cp bats.gz joker.gz
gz % cat Joker.txt >> joker.gz
gz % gunzip -c joker
I am batman
hello
gunzip: joker.gz: trailing garbage ignored
gz % cp joker.gz badRobin.gz
gz % gzip -c robin.txt >> badRobin.gz
gz % gunzip -c badRobin
I am batman
hello
gunzip: badRobin.gz: trailing garbage ignored
Here is the hexdump of the badRobin.gz
gz % hexdump -C badRobin.gz
00000000 1f 8b 08 08 81 13 a9 66 00 03 62 61 74 6d 61 6e |.......f..batman|
00000010 2e 74 78 74 00 f3 54 48 cc 55 48 4a 2c c9 4d cc |.txt..TH.UHJ,.M.|
00000020 e3 02 00 f0 49 dd 72 0c 00 00 00 1f 8b 08 08 8f |....I.r.........|
00000030 13 a9 66 00 03 68 65 6c 6c 6f 2e 74 78 74 00 cb |..f..hello.txt..|
00000040 48 cd c9 c9 e7 02 00 20 30 3a 36 06 00 00 00 49 |H...... 0:6....I|
00000050 20 61 6d 20 74 68 65 20 4a 6f 6b 65 72 0a 1f 8b | am the Joker...|
00000060 08 08 66 0f a9 66 00 03 72 6f 62 69 6e 2e 74 78 |..f..f..robin.tx|
00000070 74 00 0b ca 4f ca cc 53 f0 48 2d 4a e5 02 00 b3 |t...O..S.H-J....|
00000080 74 fc c1 0b 00 00 00 |t......|
00000087
The following program will use GZIPInputStream and GzipCompressorinputStream to access badRobin.gz
package gzip;
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream;
import org.testng.annotations.Test;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;
public class GZipInputStreamTest {
@Test
public void openWithGZipInputStream() throws IOException {
String f = "gz/badRobin.gz";
try (var is = new FileInputStream(f); var gis = new GZIPInputStream(is)) {
var bytes = gis.readAllBytes();
System.out.printf("contents: %s%n", new String(bytes));
}
}
@Test
public void openWithGzipCompressorInputStream() throws IOException {
String f = "gz/badRobin.gz";;
try (var is = new FileInputStream(f);
var giz = new GzipCompressorInputStream(is, true)) {
var bytes = giz.readAllBytes();
System.out.printf("contents: %s%n", new String(bytes));
}
}
}
The output from openWithGZipInputStream():
contents: I am batman
hello
The output from openWithGzipCompressorInputStream():
java.io.IOException: Garbage after a valid .gz stream
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.init(GzipCompressorInputStream.java:193)
at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:356)
at java.base/java.io.InputStream.readNBytes(InputStream.java:412)
at java.base/java.io.InputStream.readAllBytes(InputStream.java:349)
at gzip.GZipInputStreamTest.openWithGzipCompressorInputStream(GZipInputStreamTest.java:31)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:599)
at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1597)
at org.testng.TestRunner.privateRun(TestRunner.java:764)
at org.testng.TestRunner.run(TestRunner.java:585)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
at org.testng.SuiteRunner.run(SuiteRunner.java:286)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
at org.testng.TestNG.runSuites(TestNG.java:1069)
at org.testng.TestNG.run(TestNG.java:1037)
at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:105)
The behavior of GZIPInputStream is in-line with gzip and gunzip where it stops processing after encountering bad data after the 8 byte end trailer containing the CRC32 and ISIZE fields . GzipCompressorInputStream throws an IOException without returning any data.
WinZip on my Mac also opens badRobin.gz similar to gunzip, that is does not return an error.
Based on the above, I am reluctant to change the current behavior given it appears to have been modeled after gzip/gunzip as well as WinZip.
As far as providing a means to fail whenever there is bad data after the end trailer, I think we would need more input on the merits supporting this and how the behavior would be model if GZIPInputStream was enhanced.
> Did you double check the unviewable issues? What does JDK-8023431 say?
It is just a fix for an internal intermittent test run failure. No additional details
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18385#issuecomment-2258868069
More information about the core-libs-dev
mailing list