Improving ZipFile.getEntry performance using Bloom filters
Eirik Bjørsnøs
eirbjo at gmail.com
Mon Apr 13 19:26:46 UTC 2020
Hi,
While working on improvements to JarFile.getVersionedEntry, it occurred to
me that the missing lookups we are removing there may be just one example
of a more general issue.
To get a better understanding of how ZipFile.getEntry is used, I added some
PerfCounters for the number of hits and misses in calls to
ZipFile.Source.getEntryPos() and the elapsed run times for these hits and
misses.
To test this I used Spring PetClinic. I only care about startup time here,
since that's when most of the Zip lookups happens.
Here's what I found:
Hits 10776, misses: 744642
=> 1.4 percent of lookups are hits
Hits runtime: 7995508 ns, miss runtime: 328360900 ns
=> 2.4 percent of run time is spent on hits.
The startup time reported by Spring Petclinic was 6395 ms. 5 percent of
this startup time was spent on missed getEntry lookups.
If we can improve the performance of lookup misses, I think there is a good
potential for overall performance wins.
One idea I had about how this could be done was to use Bloom filters. Bloom
filters provide a fast, space-efficient, probabilistic data structure which
may be used to determine that an element is definitely not a member of a
set.
I made a sloppy Bloom filter implementation and updated
ZipFile.Source.getEntryPos to check this filter first and return -1 if the
name is definitely not in the jar.
This gave the following results for Spring Petclinic startup:
Hits run time: 10964825 ns, miss runtime: 233868876 ns
We see that while hits are now 1.4x slower, the total time spent on lookups
is only 73 percent compared to that of the baseline.
This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
(assuming single-threaded class loading here).
I expect that this can be improved further with a more clever use of hash
functions. Specifically it would be great to skip earlier in case of a
miss, before the String to byte[] encoding happens. Also, I haven't
analysed the false positive rate in the bloom filter, so that's probably
open for some tuning.
I also expect lookup misses to be even more common on applications with
longer class paths and more complex class loader delegation.
Cheers,
Eirik.
More information about the core-libs-dev
mailing list