RFR: 8282648: Weaken the InflaterInputStream specification in order to allow faster Zip implementations [v11]

Thu Aug 25 20:53:58 UTC 2022

On Wed, 27 Jul 2022 10:47:44 GMT, Volker Simonis <simonis at openjdk.org> wrote:

>> Add an API note to `InflaterInputStream::read(byte[] b, int off, int len)` to highlight that it might  write more bytes than the returned number of inflated bytes into the buffer `b`.
>> 
>> The superclass `java.io.InputStream` specifies that `read(byte[] b, int off, int len)` will leave the content beyond the last read byte in the read buffer `b` unaffected. However, the overridden `read` method in `InflaterInputStream` passes the read buffer `b` to `Inflater::inflate(byte[] b, int off, int len)` which doesn't provide this guarantee. Depending on implementation details, `Inflater::inflate` might write more than the returned number of inflated bytes into the buffer `b`.
>> 
>> ### TL;DR
>> 
>> `java.util.zip.Inflater` is the Java wrapper class for zlib's inflater functionality. `Inflater::inflate(byte[] output, int off, int len)` currently calls zlib's native `inflate(..)` function and passes the address of `output[off]` and `len` to it via JNI.
>> 
>> The specification of zlib's `inflate(..)` function (i.e. the [API documentation in the original zlib implementation](https://github.com/madler/zlib/blob/cacf7f1d4e3d44d871b605da3b647f07d718623f/zlib.h#L400)) doesn't give any guarantees with regard to usage of the output buffer. It only states that upon completion the function will return the number of bytes that have been written (i.e. "inflated") into the output buffer.
>> 
>> The original zlib implementation only wrote as many bytes into the output buffer as it inflated. However, this is not a hard requirement and newer, more performant implementations of the zlib library like [zlib-chromium](https://chromium.googlesource.com/chromium/src/third_party/zlib/) or [zlib-cloudflare](https://github.com/cloudflare/zlib) can use more bytes of the output buffer than they actually inflate as a scratch buffer. See https://github.com/simonis/zlib-chromium for a more detailed description of their approach and its performance benefit.
>> 
>> These new zlib versions can still be used transparently from Java (e.g. by putting them into the `LD_LIBRARY_PATH` or by using `LD_PRELOAD`), because they still fully comply to specification of `Inflater::inflate(..)`. However, we might run into problems when using the `Inflater` functionality from the `InflaterInputStream` class. `InflaterInputStream` is derived from from `InputStream` and as such, its `read(byte[] b, int off, int len)` method is quite constrained. It specifically specifies that if *k* bytes have been read, then "these bytes will be stored in elements `b[off]` through `b[off+`*k*`-1]`, leaving elements `b[off+`*k*`]` through `b[off+len-1]` **unaffected**". But `InflaterInputStream::read(byte[] b, int off, int len)` (which is constrained by `InputStream::read(..)`'s specification) calls `Inflater::inflate(byte[] b, int off, int len)` and directly passes its output buffer down to the native zlib `inflate(..)` method which is free to change the bytes beyond `b[off+`
 *k*`]` (where *k* is the number of inflated bytes).
>> 
>> From a practical point of view, I don't see this as a big problem, because callers of `InflaterInputStream::read(byte[] b, int off, int len)` can never know how many bytes will be written into the output buffer `b` (and in fact its content can always be completely overwritten). It therefore makes no sense to depend on any data there being untouched after the call. Also, having used zlib-cloudflare productively for about two years, we haven't seen real-world issues because of this behavior yet. However, from a specification point of view it is easy to artificially construct a program which violates `InflaterInputStream::read(..)`'s postcondition if using one of the alterantive zlib implementations. A recently integrated JTreg test (test/jdk/jdk/nio/zipfs/ZipFSOutputStreamTest.java) "unintentionally" fails with zlib-chromium but can fixed easily to run with alternative implementations as well (see [JDK-8283756](https://bugs.openjdk.java.net/browse/JDK-8283756)).
>
> Volker Simonis has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
> 
>  - Add apiNote to methods which return InflaterInputStreams as InputStreams
>  - Updated wording based on @JoeDarcy's CSR review
>  - Updated wording based on @LanceAndersen's review
>  - Adapted wording based on @AlanBateman's review, removed implementation note on ZipFile::getInputStream and aligned wording for all ::read methods
>  - Adapted wording and copyrights based on @jaikiran review
>  - Added API-refinement to GZIPInputStream::read()/ZipInputStream::read() and an Implementation note to ZipFile::getInputStream()
>  - Extended API-doc based on reviewer feedback
>  - 8282648: Weaken the InflaterInputStream specification in order to allow faster Zip implementations

Please update the CSR once you’ve finalized the specification changes.

src/java.base/share/classes/java/net/URLConnection.java line 848:

> 846:      * last inflated byte undefined after a read operation (see {@link
> 847:      * java.util.zip.InflaterInputStream#read(byte[], int, int)
> 848:      * InflaterInputStream.read(byte[], int, int)}).

Consider this wording, which makes the danger clearer, here and elsewhere:

     * @apiNote The {@code InputStream} returned by this method can wrap an
     * {@link java.util.zip.InflaterInputStream InflaterInputStream}, whose
     * {@link java.util.zip.InflaterInputStream#read(byte[], int, int)
     * read(byte[], int, int)} method can modify any element of the output
     * buffer.

src/java.base/share/classes/java/util/zip/InflaterInputStream.java line 136:

> 134:      * are undefined (an implementation is free to change them during the inflate
> 135:      * operation). If the return value is -1 or an exception is thrown, then the content of
> 136:      * {@code b[off]} to {@code b[off+}<i>len</i>{@code -1]} is undefined.

Consider this more precise wording, both here and in `GZIPInputStream`, `ZipInputStream`, and `JarInputStream`:

     * If this method returns a nonzero integer <i>n</i> then {@code buf[off]}
     * through {@code buf[off+}<i>n</i>{@code -1]} contain the uncompressed
     * data.  The content of elements {@code buf[off+}<i>n</i>{@code ]} through
     * {@code buf[off+}<i>len</i>{@code -1]} is undefined, contrary to the
     * specification of the {@link java.io.InputStream InputStream} superclass,
     * so an implementation is free to modify these elements during the inflate
     * operation. If this method returns {@code -1} or throws an exception then
     * the content of {@code buf[off]} through {@code buf[off+}<i>len</i>{@code
     * -1]} is undefined.

src/java.base/share/classes/java/util/zip/InflaterInputStream.java line 141:

> 139:      * @param off the start offset in the destination array {@code b}
> 140:      * @param len the maximum number of bytes read
> 141:      * @return the actual number of inflated bytes, or -1 if the end of the

s/inflated bytes/bytes inflated/

-------------

Changes requested by mr (Lead).

PR: https://git.openjdk.org/jdk/pull/7986