RFR(s): 8244659: Improve ZipFile.getInputStream

Volker Simonis volker.simonis at gmail.com
Fri May 8 15:36:11 UTC 2020


 Hi,

can I please have a review for the following small enhancement which
improves the speed of reading from ZipFile.getInputStream() by ~5%:

http://cr.openjdk.java.net/~simonis/webrevs/2020/8244659/
https://bugs.openjdk.java.net/browse/JDK-8244659

ZipFile.getInputStream() tries to find a good size for sizing the internal
buffer of the underlying InflaterInputStream. This buffer is used to read
the compressed data from the associated InputStream. Unfortunately,
ZipFile.getInputStream() uses CENLEN (i.e. the uncompressed size of a
ZipEntry) instead of CENSIZ (i.e. the compressed size of a ZipEntry) to
configure the input buffer and thus unnecessarily wastes memory, because
the corresponding, compressed input data is at most CENSIZ bytes long.

After fixing this and doing some benchmarks, I realized that a much bigger
problem is the continuous allocation of new, temporary input buffers for
each new input stream. Assuming that a zip files usually has hundreds if
not thousands of ZipEntries, I think it makes sense to cache these input
buffers. Fortunately, ZipFile already has a built-in mechanism for such
caching which it uses for caching the Inflaters needed for each new input
stream. In order to cache the buffers as well, I had to add a new ,
package-private constructor to InflaterInputStream. I'm not sure if it
makes sense to make this new constructor public, to enable other users of
InflaterInputStream to pre-allocate the buffer. If you think so, I'd be
happy to do that change and open a CSR for this issue.

Adding a cache for input stream buffers increases the speed of reading
ZipEntries from an InputStream by roughly 5% (see benchmark results below).
More importantly, it also decreases the memory consumption for each call to
ZipFile.getInputStream() which can be quite significant if many ZipEntries
are read from a ZipFile. One visible effect of caching the input buffers is
that the manual JTreg test java/util/zip/ZipFile/TestZipFile.java, which
regularly failed on my desktop with an OutOfMemoryError before, now
reliably passes (this tests calls ZipFile.getInputStream() excessively).

I've experimented with different buffer sizes (even with buffer sizes
depending on the size of the compressed ZipEntries), but couldn't see any
difference so I decided to go with a default buffer size of 65536 which
already was the maximal buffer size in use before my change.

I've also added a shortcut to Inflater which prevents us doing a native
call down to libz's inflate() method every time we call Inflater.inflate()
with "input == ZipUtils.defaultBuf" which is the default for every newly
created Inflater and for Inflaters after "Inflater.reset()" has been called
on them.

Following some JMH benchmark results which show the time and memory used to
read all bytes from a ZipEntry before and after this change. The 'size'
parameter denotes the uncompressed size of the corresponding ZipEntries.

In the "BEFORE" numbers, when looking at the "gc.alloc.rate.norm" values,
you can see the anomaly caused by using CENLEN instead of CENSIZ in
ZipFile.getInputStream(). I.e. getInputStream() chooses to big buffers
because it looks at the uncompressed ZipEntry sizes which are ~ 6 times
bigger than the compressed sizes. Also, the old implementation capped
buffers bigger than 65536 to 8192 bytes.

The memory savings for a call to getInputStream() are obviously the effect
of repetadly calling getInputStream() on the same zip file (becuase only in
that case, the caching of the input buffers pays of). But as I wrote
before, I think it is common to have mor then a few entries in a zip file
and even if not, the overhead of caching is minimal compared to the
situation we had before the change.

Thank you and best regards,
Volker

= BEFORE 8244659 =
Benchmark                                                          (size)
Mode  Cnt      Score     Error   Units
ZipFileGetInputStream.readAllBytes                                   1024
avgt    3     13.577 ±   0.540   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               1024
avgt    3   1872.673 ±   0.317    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                         1024
avgt    3     57.000            counts
ZipFileGetInputStream.readAllBytes:·gc.time                          1024
avgt    3     15.000                ms
ZipFileGetInputStream.readAllBytes                                   4096
avgt    3     20.938 ±   0.577   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               4096
avgt    3   4945.793 ±   0.493    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                         4096
avgt    3    102.000            counts
ZipFileGetInputStream.readAllBytes:·gc.time                          4096
avgt    3     25.000                ms
ZipFileGetInputStream.readAllBytes                                  16384
avgt    3     51.348 ±   2.600   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              16384
avgt    3  17238.030 ±   3.183    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                        16384
avgt    3    144.000            counts
ZipFileGetInputStream.readAllBytes:·gc.time                         16384
avgt    3     33.000                ms
ZipFileGetInputStream.readAllBytes                                  65536
avgt    3    203.082 ±   7.046   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              65536
avgt    3   9035.475 ±   7.426    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                        65536
avgt    3     18.000            counts
ZipFileGetInputStream.readAllBytes:·gc.time                         65536
avgt    3      5.000                ms
ZipFileGetInputStream.readAllBytes                                 262144
avgt    3    801.928 ±  22.474   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm             262144
avgt    3   9034.192 ±   0.047    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                       262144
avgt    3      3.000            counts
ZipFileGetInputStream.readAllBytes:·gc.time                        262144
avgt    3      1.000                ms
ZipFileGetInputStream.readAllBytes                                1048576
avgt    3   3154.747 ±  57.588   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm            1048576
avgt    3   9032.194 ±   0.004    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                      1048576
avgt    3        ≈ 0            counts

= AFTER 8244659 =
Benchmark                                                          (size)
Mode  Cnt     Score    Error   Units
ZipFileGetInputStream.readAllBytes                                   1024
avgt    3    13.031 ±  0.452   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               1024
avgt    3   824.311 ±  0.027    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                         1024
avgt    3    27.000           counts
ZipFileGetInputStream.readAllBytes:·gc.time                          1024
avgt    3     7.000               ms
ZipFileGetInputStream.readAllBytes                                   4096
avgt    3    20.018 ±  0.805   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               4096
avgt    3   824.289 ±  0.722    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                         4096
avgt    3    15.000           counts
ZipFileGetInputStream.readAllBytes:·gc.time                          4096
avgt    3     4.000               ms
ZipFileGetInputStream.readAllBytes                                  16384
avgt    3    48.916 ±  1.140   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              16384
avgt    3   824.263 ±  0.008    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                        16384
avgt    3     6.000           counts
ZipFileGetInputStream.readAllBytes:·gc.time                         16384
avgt    3     1.000               ms
ZipFileGetInputStream.readAllBytes                                  65536
avgt    3   192.815 ±  4.102   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              65536
avgt    3   824.012 ±  0.001    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                        65536
avgt    3       ≈ 0           counts
ZipFileGetInputStream.readAllBytes                                 262144
avgt    3   755.713 ± 42.408   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm             262144
avgt    3   824.047 ±  0.003    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                       262144
avgt    3       ≈ 0           counts
ZipFileGetInputStream.readAllBytes                                1048576
avgt    3  2989.236 ±  8.808   us/op
ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm            1048576
avgt    3   824.184 ±  0.002    B/op
ZipFileGetInputStream.readAllBytes:·gc.count                      1048576
avgt    3       ≈ 0           counts


More information about the core-libs-dev mailing list