RFR(s): 8244659: Improve ZipFile.getInputStream
Lance Andersen
lance.andersen at oracle.com
Mon May 11 20:26:58 UTC 2020
Hi Volker,
Could you update your patch now that Claes’s changes are back as I think that would make it easier to review
Thank you!
> On May 8, 2020, at 11:36 AM, Volker Simonis <volker.simonis at gmail.com> wrote:
>
> Hi,
>
> can I please have a review for the following small enhancement which
> improves the speed of reading from ZipFile.getInputStream() by ~5%:
>
> http://cr.openjdk.java.net/~simonis/webrevs/2020/8244659/
> https://bugs.openjdk.java.net/browse/JDK-8244659
>
> ZipFile.getInputStream() tries to find a good size for sizing the internal
> buffer of the underlying InflaterInputStream. This buffer is used to read
> the compressed data from the associated InputStream. Unfortunately,
> ZipFile.getInputStream() uses CENLEN (i.e. the uncompressed size of a
> ZipEntry) instead of CENSIZ (i.e. the compressed size of a ZipEntry) to
> configure the input buffer and thus unnecessarily wastes memory, because
> the corresponding, compressed input data is at most CENSIZ bytes long.
>
> After fixing this and doing some benchmarks, I realized that a much bigger
> problem is the continuous allocation of new, temporary input buffers for
> each new input stream. Assuming that a zip files usually has hundreds if
> not thousands of ZipEntries, I think it makes sense to cache these input
> buffers. Fortunately, ZipFile already has a built-in mechanism for such
> caching which it uses for caching the Inflaters needed for each new input
> stream. In order to cache the buffers as well, I had to add a new ,
> package-private constructor to InflaterInputStream. I'm not sure if it
> makes sense to make this new constructor public, to enable other users of
> InflaterInputStream to pre-allocate the buffer. If you think so, I'd be
> happy to do that change and open a CSR for this issue.
>
> Adding a cache for input stream buffers increases the speed of reading
> ZipEntries from an InputStream by roughly 5% (see benchmark results below).
> More importantly, it also decreases the memory consumption for each call to
> ZipFile.getInputStream() which can be quite significant if many ZipEntries
> are read from a ZipFile. One visible effect of caching the input buffers is
> that the manual JTreg test java/util/zip/ZipFile/TestZipFile.java, which
> regularly failed on my desktop with an OutOfMemoryError before, now
> reliably passes (this tests calls ZipFile.getInputStream() excessively).
>
> I've experimented with different buffer sizes (even with buffer sizes
> depending on the size of the compressed ZipEntries), but couldn't see any
> difference so I decided to go with a default buffer size of 65536 which
> already was the maximal buffer size in use before my change.
>
> I've also added a shortcut to Inflater which prevents us doing a native
> call down to libz's inflate() method every time we call Inflater.inflate()
> with "input == ZipUtils.defaultBuf" which is the default for every newly
> created Inflater and for Inflaters after "Inflater.reset()" has been called
> on them.
>
> Following some JMH benchmark results which show the time and memory used to
> read all bytes from a ZipEntry before and after this change. The 'size'
> parameter denotes the uncompressed size of the corresponding ZipEntries.
>
> In the "BEFORE" numbers, when looking at the "gc.alloc.rate.norm" values,
> you can see the anomaly caused by using CENLEN instead of CENSIZ in
> ZipFile.getInputStream(). I.e. getInputStream() chooses to big buffers
> because it looks at the uncompressed ZipEntry sizes which are ~ 6 times
> bigger than the compressed sizes. Also, the old implementation capped
> buffers bigger than 65536 to 8192 bytes.
>
> The memory savings for a call to getInputStream() are obviously the effect
> of repetadly calling getInputStream() on the same zip file (becuase only in
> that case, the caching of the input buffers pays of). But as I wrote
> before, I think it is common to have mor then a few entries in a zip file
> and even if not, the overhead of caching is minimal compared to the
> situation we had before the change.
>
> Thank you and best regards,
> Volker
>
> = BEFORE 8244659 =
> Benchmark (size)
> Mode Cnt Score Error Units
> ZipFileGetInputStream.readAllBytes 1024
> avgt 3 13.577 ± 0.540 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 1024
> avgt 3 1872.673 ± 0.317 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 1024
> avgt 3 57.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 1024
> avgt 3 15.000 ms
> ZipFileGetInputStream.readAllBytes 4096
> avgt 3 20.938 ± 0.577 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 4096
> avgt 3 4945.793 ± 0.493 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 4096
> avgt 3 102.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 4096
> avgt 3 25.000 ms
> ZipFileGetInputStream.readAllBytes 16384
> avgt 3 51.348 ± 2.600 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 16384
> avgt 3 17238.030 ± 3.183 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 16384
> avgt 3 144.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 16384
> avgt 3 33.000 ms
> ZipFileGetInputStream.readAllBytes 65536
> avgt 3 203.082 ± 7.046 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 65536
> avgt 3 9035.475 ± 7.426 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 65536
> avgt 3 18.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 65536
> avgt 3 5.000 ms
> ZipFileGetInputStream.readAllBytes 262144
> avgt 3 801.928 ± 22.474 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 262144
> avgt 3 9034.192 ± 0.047 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 262144
> avgt 3 3.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 262144
> avgt 3 1.000 ms
> ZipFileGetInputStream.readAllBytes 1048576
> avgt 3 3154.747 ± 57.588 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 1048576
> avgt 3 9032.194 ± 0.004 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 1048576
> avgt 3 ≈ 0 counts
>
> = AFTER 8244659 =
> Benchmark (size)
> Mode Cnt Score Error Units
> ZipFileGetInputStream.readAllBytes 1024
> avgt 3 13.031 ± 0.452 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 1024
> avgt 3 824.311 ± 0.027 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 1024
> avgt 3 27.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 1024
> avgt 3 7.000 ms
> ZipFileGetInputStream.readAllBytes 4096
> avgt 3 20.018 ± 0.805 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 4096
> avgt 3 824.289 ± 0.722 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 4096
> avgt 3 15.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 4096
> avgt 3 4.000 ms
> ZipFileGetInputStream.readAllBytes 16384
> avgt 3 48.916 ± 1.140 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 16384
> avgt 3 824.263 ± 0.008 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 16384
> avgt 3 6.000 counts
> ZipFileGetInputStream.readAllBytes:·gc.time 16384
> avgt 3 1.000 ms
> ZipFileGetInputStream.readAllBytes 65536
> avgt 3 192.815 ± 4.102 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 65536
> avgt 3 824.012 ± 0.001 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 65536
> avgt 3 ≈ 0 counts
> ZipFileGetInputStream.readAllBytes 262144
> avgt 3 755.713 ± 42.408 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 262144
> avgt 3 824.047 ± 0.003 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 262144
> avgt 3 ≈ 0 counts
> ZipFileGetInputStream.readAllBytes 1048576
> avgt 3 2989.236 ± 8.808 us/op
> ZipFileGetInputStream.readAllBytes:·gc.alloc.rate.norm 1048576
> avgt 3 824.184 ± 0.002 B/op
> ZipFileGetInputStream.readAllBytes:·gc.count 1048576
> avgt 3 ≈ 0 counts
<http://oracle.com/us/design/oracle-email-sig-198324.gif>
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>
<http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering
1 Network Drive
Burlington, MA 01803
Lance.Andersen at oracle.com <mailto:Lance.Andersen at oracle.com>
More information about the core-libs-dev
mailing list