RFR(s): 8244659: Improve ZipFile.getInputStream

Lance Andersen lance.andersen at oracle.com
Mon May 11 20:26:58 UTC 2020


Hi Volker,

Could you update your patch now that Claes’s changes are back as I think that would make it easier to review 

Thank you!

> On May 8, 2020, at 11:36 AM, Volker Simonis <volker.simonis at gmail.com> wrote:
> 
> Hi,
> 
> can I please have a review for the following small enhancement which
> improves the speed of reading from ZipFile.getInputStream() by ~5%:
> 
> http://cr.openjdk.java.net/~simonis/webrevs/2020/8244659/
> https://bugs.openjdk.java.net/browse/JDK-8244659
> 
> ZipFile.getInputStream() tries to find a good size for sizing the internal
> buffer of the underlying InflaterInputStream. This buffer is used to read
> the compressed data from the associated InputStream. Unfortunately,
> ZipFile.getInputStream() uses CENLEN (i.e. the uncompressed size of a
> ZipEntry) instead of CENSIZ (i.e. the compressed size of a ZipEntry) to
> configure the input buffer and thus unnecessarily wastes memory, because
> the corresponding, compressed input data is at most CENSIZ bytes long.
> 
> After fixing this and doing some benchmarks, I realized that a much bigger
> problem is the continuous allocation of new, temporary input buffers for
> each new input stream. Assuming that a zip files usually has hundreds if
> not thousands of ZipEntries, I think it makes sense to cache these input
> buffers. Fortunately, ZipFile already has a built-in mechanism for such
> caching which it uses for caching the Inflaters needed for each new input
> stream. In order to cache the buffers as well, I had to add a new ,
> package-private constructor to InflaterInputStream. I'm not sure if it
> makes sense to make this new constructor public, to enable other users of
> InflaterInputStream to pre-allocate the buffer. If you think so, I'd be
> happy to do that change and open a CSR for this issue.
> 
> Adding a cache for input stream buffers increases the speed of reading
> ZipEntries from an InputStream by roughly 5% (see benchmark results below).
> More importantly, it also decreases the memory consumption for each call to
> ZipFile.getInputStream() which can be quite significant if many ZipEntries
> are read from a ZipFile. One visible effect of caching the input buffers is
> that the manual JTreg test java/util/zip/ZipFile/TestZipFile.java, which
> regularly failed on my desktop with an OutOfMemoryError before, now
> reliably passes (this tests calls ZipFile.getInputStream() excessively).
> 
> I've experimented with different buffer sizes (even with buffer sizes
> depending on the size of the compressed ZipEntries), but couldn't see any
> difference so I decided to go with a default buffer size of 65536 which
> already was the maximal buffer size in use before my change.
> 
> I've also added a shortcut to Inflater which prevents us doing a native
> call down to libz's inflate() method every time we call Inflater.inflate()
> with "input == ZipUtils.defaultBuf" which is the default for every newly
> created Inflater and for Inflaters after "Inflater.reset()" has been called
> on them.
> 
> Following some JMH benchmark results which show the time and memory used to
> read all bytes from a ZipEntry before and after this change. The 'size'
> parameter denotes the uncompressed size of the corresponding ZipEntries.
> 
> In the "BEFORE" numbers, when looking at the "gc.alloc.rate.norm" values,
> you can see the anomaly caused by using CENLEN instead of CENSIZ in
> ZipFile.getInputStream(). I.e. getInputStream() chooses to big buffers
> because it looks at the uncompressed ZipEntry sizes which are ~ 6 times
> bigger than the compressed sizes. Also, the old implementation capped
> buffers bigger than 65536 to 8192 bytes.
> 
> The memory savings for a call to getInputStream() are obviously the effect
> of repetadly calling getInputStream() on the same zip file (becuase only in
> that case, the caching of the input buffers pays of). But as I wrote
> before, I think it is common to have mor then a few entries in a zip file
> and even if not, the overhead of caching is minimal compared to the
> situation we had before the change.
> 
> Thank you and best regards,
> Volker
> 
> = BEFORE 8244659 =
> Benchmark                                                          (size)
> Mode  Cnt      Score     Error   Units
> ZipFileGetInputStream.readAllBytes                                   1024
> avgt    3     13.577 ±   0.540   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               1024
> avgt    3   1872.673 ±   0.317    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                         1024
> avgt    3     57.000            counts
> ZipFileGetInputStream.readAllBytes:·gc.time                          1024
> avgt    3     15.000                ms
> ZipFileGetInputStream.readAllBytes                                   4096
> avgt    3     20.938 ±   0.577   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               4096
> avgt    3   4945.793 ±   0.493    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                         4096
> avgt    3    102.000            counts
> ZipFileGetInputStream.readAllBytes:·gc.time                          4096
> avgt    3     25.000                ms
> ZipFileGetInputStream.readAllBytes                                  16384
> avgt    3     51.348 ±   2.600   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              16384
> avgt    3  17238.030 ±   3.183    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                        16384
> avgt    3    144.000            counts
> ZipFileGetInputStream.readAllBytes:·gc.time                         16384
> avgt    3     33.000                ms
> ZipFileGetInputStream.readAllBytes                                  65536
> avgt    3    203.082 ±   7.046   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              65536
> avgt    3   9035.475 ±   7.426    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                        65536
> avgt    3     18.000            counts
> ZipFileGetInputStream.readAllBytes:·gc.time                         65536
> avgt    3      5.000                ms
> ZipFileGetInputStream.readAllBytes                                 262144
> avgt    3    801.928 ±  22.474   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm             262144
> avgt    3   9034.192 ±   0.047    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                       262144
> avgt    3      3.000            counts
> ZipFileGetInputStream.readAllBytes:·gc.time                        262144
> avgt    3      1.000                ms
> ZipFileGetInputStream.readAllBytes                                1048576
> avgt    3   3154.747 ±  57.588   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm            1048576
> avgt    3   9032.194 ±   0.004    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                      1048576
> avgt    3        ≈ 0            counts
> 
> = AFTER 8244659 =
> Benchmark                                                          (size)
> Mode  Cnt     Score    Error   Units
> ZipFileGetInputStream.readAllBytes                                   1024
> avgt    3    13.031 ±  0.452   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               1024
> avgt    3   824.311 ±  0.027    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                         1024
> avgt    3    27.000           counts
> ZipFileGetInputStream.readAllBytes:·gc.time                          1024
> avgt    3     7.000               ms
> ZipFileGetInputStream.readAllBytes                                   4096
> avgt    3    20.018 ±  0.805   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm               4096
> avgt    3   824.289 ±  0.722    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                         4096
> avgt    3    15.000           counts
> ZipFileGetInputStream.readAllBytes:·gc.time                          4096
> avgt    3     4.000               ms
> ZipFileGetInputStream.readAllBytes                                  16384
> avgt    3    48.916 ±  1.140   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              16384
> avgt    3   824.263 ±  0.008    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                        16384
> avgt    3     6.000           counts
> ZipFileGetInputStream.readAllBytes:·gc.time                         16384
> avgt    3     1.000               ms
> ZipFileGetInputStream.readAllBytes                                  65536
> avgt    3   192.815 ±  4.102   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm              65536
> avgt    3   824.012 ±  0.001    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                        65536
> avgt    3       ≈ 0           counts
> ZipFileGetInputStream.readAllBytes                                 262144
> avgt    3   755.713 ± 42.408   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm             262144
> avgt    3   824.047 ±  0.003    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                       262144
> avgt    3       ≈ 0           counts
> ZipFileGetInputStream.readAllBytes                                1048576
> avgt    3  2989.236 ±  8.808   us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm            1048576
> avgt    3   824.184 ±  0.002    B/op
> ZipFileGetInputStream.readAllBytes:·gc.count                      1048576
> avgt    3       ≈ 0           counts

 <http://oracle.com/us/design/oracle-email-sig-198324.gif>
 <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
 <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering 
1 Network Drive 
Burlington, MA 01803
Lance.Andersen at oracle.com <mailto:Lance.Andersen at oracle.com>





More information about the core-libs-dev mailing list