Improving ZipFile.getEntry performance using Bloom filters

Mon Apr 13 19:26:46 UTC 2020

Hi,

While working on improvements to JarFile.getVersionedEntry, it occurred to
me that the missing lookups we are removing there may be just one example
of a more general issue.

To get a better understanding of how ZipFile.getEntry is used, I added some
PerfCounters for the number of hits and misses in calls to
ZipFile.Source.getEntryPos() and the elapsed run times for these hits and
misses.

To test this I used Spring PetClinic. I only care about startup time here,
since that's when most of the Zip lookups happens.

Here's what I found:

Hits 10776, misses: 744642
=> 1.4 percent of lookups are hits

Hits runtime: 7995508 ns, miss runtime: 328360900 ns
=> 2.4 percent of run time is spent on hits.

The startup time reported by Spring Petclinic was 6395 ms. 5 percent of
this startup time was spent on missed getEntry lookups.

If we can improve the performance of lookup misses, I think there is a good
potential for overall performance wins.

One idea I had about how this could be done was to use Bloom filters. Bloom
filters provide a fast, space-efficient, probabilistic data structure which
may be used to determine that an element is definitely not a member of a
set.

I made a sloppy Bloom filter implementation and updated
ZipFile.Source.getEntryPos to check this filter first and return -1 if the
name is definitely not in the jar.

This gave the following results for Spring Petclinic startup:

Hits run time: 10964825 ns, miss runtime: 233868876 ns

We see that while hits are now 1.4x slower, the total time spent on lookups
is only 73 percent compared to that of the baseline.

This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
(assuming single-threaded class loading here).

I expect that this can be improved further with a more clever use of hash
functions. Specifically it would be great to skip earlier in case of a
miss, before the String to byte[] encoding happens. Also, I haven't
analysed the false positive rate in the bloom filter, so that's probably
open for some tuning.

I also expect lookup misses to be even more common on applications with
longer class paths and more complex class loader delegation.

Cheers,
Eirik.