RFC: JDK-8282648: Problems due to conflicting specification of Inflater::inflate(..) and InflaterInputStream::read(..)

Tue Mar 22 21:35:29 UTC 2022

On Mar 22, 2022, at 12:28 PM, Volker Simonis <volker.simonis at gmail.com<mailto:volker.simonis at gmail.com>> wrote:

On Mon, Mar 21, 2022 at 8:24 PM Lance Andersen
<lance.andersen at oracle.com<mailto:lance.andersen at oracle.com>> wrote:

Hi Lance,

Thanks for looking into this issue. Please find my answers inline:

Hi Volker,

I have read through what you have provided/pointed to, thank you, and on the surface what you are suggesting sounds reasonable.

That being said given that this API dates back to 1997ish, I think we have to be careful not introduce any regressions with existing applications with the proposal you suggest(even though it is just relaxes the spec), as you mentioned their is a jtreg test that  seems to fail.

Have you run the JCK with your ZLIB implementation?  I only skimmed the tests but looks like there might be a couple of tests which validate the array’s contents.

Yes, I did run the JCK tests with the optimized zlib implementations
and couldn't find any problems related to this issue. We've also using
the optimized version of zlib in production for quite a while without
any problems. However, I did found a problem related to a test which
was copied from "test/jdk/java/util/zip/DeflateIn_InflateOut.java".

That was the test on a quick scan I wondered about actually...
That jtreg test was fixed with "8240226: DeflateIn_InflateOut.java
test incorrectly assumes size of compressed" [2] in JDK 15 but
apparently that fix didn't made it into the JCK version. The test has
now been excluded from JCK 17/18.

Finally, the currently failing "nio/zipfs/ZipFSOutputStreamTest.java"
[3] jtreg test is not failing because it specifically checks that the
bytes in the output buffer beyond the last inflated byte remains
untouched. It's just because they use a poor, "lazy" method of
comparing inflated content to the original content and can easily be
fixed. Instead of:
```
while ((numRead = is.read(buf)) != -1) {
   totalRead += numRead;
   // verify the content
   Assert.assertEquals(Arrays.mismatch(buf, chunk), -1, "Unexpected content");
}
```
just use:
```
while ((numRead = is.read(buf)) != -1) {
   totalRead += numRead;
   // verify the content
   Assert.assertEquals(Arrays.mismatch(buf, 0, numRead, chunk, 0,
numRead), -1, "Unexpected content");
}
```

Why don’t you generate a PR to fix this separate from the change in spec proposal?

[1] https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/test/jdk/java/util/zip/DeflateIn_InflateOut.java__;!!ACWV5N9M2RV99hQ!d0HFq-TCYrPXsCP8R_ja46jCDmVAXQCpz58ddo4rzXdRdDikJ27vpsKzBuKmMKov2g$
[2] https://bugs.openjdk.java.net/browse/JDK-8240226
[3] https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/nio/zipfs/ZipFSOutputStreamTest.java__;!!ACWV5N9M2RV99hQ!d0HFq-TCYrPXsCP8R_ja46jCDmVAXQCpz58ddo4rzXdRdDikJ27vpsKzBuJSeRPJgw$

We don’t have a lot of test coverage so if we were to consider moving forward with your proposal, I believe additional test coverage would be warranted.

Sure, I'll be happy to add some more testing. Do you have specific
ideas? In fact my suggestion relaxes the specification of read(..)
which would be hard to check :)

Just some general coverage of the API would be great not expecting tests for the relaxed specification

I think you are probably at a point where you can craft a draft CSR and PR and we can continue to review and discuss from there

Best
Lance

Thank you and best regards,
Volker

Best
Lance

On Mar 4, 2022, at 5:04 AM, Volker Simonis <volker.simonis at gmail.com<mailto:volker.simonis at gmail.com>> wrote:

`java.util.zip.Inflater` is the Java wrapper class for zlib's inflater
functionality. `Inflater::inflate(byte[] output, int off, int len)`
currently calls zlib's native `inflate(..)` function and passes the
address of `output[off]` and `len` to it via JNI.

The specification of zlib's `inflate(..)` function (i.e. the [API
documentation in the original zlib
implementation](https://urldefense.com/v3/__https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623f/zlib.h*L400__;Iw!!ACWV5N9M2RV99hQ!d0HFq-TCYrPXsCP8R_ja46jCDmVAXQCpz58ddo4rzXdRdDikJ27vpsKzBuKV0E1HqA$ ))
doesn't give any guarantees with regard to usage of the output buffer.
It only states that upon completion the function will return the
number of bytes that have been written (i.e. "inflated") into the
output buffer.

The original zlib implementation only wrote as many bytes into the
output buffer as it inflated. However, this is not a hard requirement
and newer, more performant implementations of the zlib library like
[zlib-chromium](https://urldefense.com/v3/__https://chromium.googlesource.com/chromium/src/third_party/zlib/__;!!ACWV5N9M2RV99hQ!d0HFq-TCYrPXsCP8R_ja46jCDmVAXQCpz58ddo4rzXdRdDikJ27vpsKzBuJVkxWFew$ )
or [zlib-cloudflare](https://urldefense.com/v3/__https://github.com/cloudflare/zlib__;!!ACWV5N9M2RV99hQ!d0HFq-TCYrPXsCP8R_ja46jCDmVAXQCpz58ddo4rzXdRdDikJ27vpsKzBuIpQS_D6w$ ) can use more
bytes of the output buffer than they actually inflate as a scratch
buffer. See https://urldefense.com/v3/__https://github.com/simonis/zlib-chromium__;!!ACWV5N9M2RV99hQ!d0HFq-TCYrPXsCP8R_ja46jCDmVAXQCpz58ddo4rzXdRdDikJ27vpsKzBuKrlfpHiw$  for a more
detailed description of their approach and its performance benefit.

These new zlib versions can still be used transparently from Java
(e.g. by putting them into the `LD_LIBRARY_PATH` or by using
`LD_PRELOAD`), because they still fully comply to specification of
`Inflater::inflate(..)`. However, we might run into problems when
using the `Inflater` functionality from the `InflaterInputStream`
class. `InflaterInputStream` is derived from from `InputStream` and as
such, its `read(byte[] b, int off, int len)` method is quite
constrained. It specifically specifies that if *k* bytes have been
read, then "these bytes will be stored in elements `b[off]` through
`b[off+`*k*`-1]`, leaving elements `b[off+`*k*`]` through
`b[off+len-1]` **unaffected**". But `InflaterInputStream::read(byte[]
b, int off, int len)` (which is constrained by
`InputStream::read(..)`'s specification) calls
`Inflater::inflate(byte[] b, int off, int len)` and directly passes
its output buffer down to the native zlib `inflate(..)` method which
is free to change the bytes beyond `b[off+`*k*`]` (where *k* is the
number of inflated bytes).

From a practical point of view, I don't see this as a big problem,
because callers of `InflaterInputStream::read(byte[] b, int off, int
len)` can never know how many bytes will be written into the output
buffer `b` (and in fact its content can always be completely
overwritten). It therefore makes no sense to depend on any data there
being untouched after the call. Also, having used zlib-cloudflare
productively for about two years, we haven't seen real-world issues
because of this behavior yet. However, from a specification point of
view it is easy to artificially construct a program which violates
`InflaterInputStream::read(..)`'s postcondition if using one of the
alterantive zlib implementations. A recently integrated JTreg test
(test/jdk/jdk/nio/zipfs/ZipFSOutputStreamTest.java) "unintentionally"
fails with zlib-chromium but could be easily fixed to run with
alternative implementations as well.

What should/can be done to solve this problem? I think we have three options:

1. Relax the specification of `InflaterInputStream::read(..)` and
specifically note in the API documentation that a call to
`InflaterInputStream::read(byte[] b, int off, int len)` may write more
than *k* characters into `b` where *k* is the returned number of
inflated bytes. Notice that there's a precedence for this approach in
`java.io.ByteArrayInputStream::read(..)` which "*unlike the overridden
method of InputStream, returns -1 instead of zero if the end of the
stream has been reached and len==0*".

2. Tighten the specification of `Inflater::read(byte[] b, int off, int
len)` to explicitly forbid any writes into the output array `b` beyond
the inflated bytes.

3. Change the implementation of `InflaterInputStream::read(..)` to
call `Inflater::read(..)` with a temporary buffer and only copy the
nu,ber of inflated bytes into `InflaterInputStream::read(..)`'s output
buffer. I think we all agree that this is only a theoretic option
because of its unacceptable performance regression.

I personally favor option 1 but I'm interested in your opinions?

Thank you and best regards,
Volker

Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering
1 Network Drive
Burlington, MA 01803
Lance.Andersen at oracle.com<mailto:Lance.Andersen at oracle.com>

[cid:E1C4E2F0-ECD0-4C9D-ADB4-B16CA7BCB7FC at home]

Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering
1 Network Drive
Burlington, MA 01803
Lance.Andersen at oracle.com<mailto:Lance.Andersen at oracle.com>