Improving ZipFile.getEntry performance using Bloom filters

Tue Apr 14 22:34:41 UTC 2020

On 4/13/20 12:26 PM, Eirik Bjørsnøs wrote:
> Hi,
>
> While working on improvements to JarFile.getVersionedEntry, it occurred to
> me that the missing lookups we are removing there may be just one example
> of a more general issue.
>
> To get a better understanding of how ZipFile.getEntry is used, I added some
> PerfCounters for the number of hits and misses in calls to
> ZipFile.Source.getEntryPos() and the elapsed run times for these hits and
> misses.
>
> To test this I used Spring PetClinic. I only care about startup time here,
> since that's when most of the Zip lookups happens.
>
> Here's what I found:
>
> Hits 10776, misses: 744642
> => 1.4 percent of lookups are hits
>
I am a little worried adding extra overhead unconditionally into the JAR 
reader that people may not need/want.

Maybe we should take a step back and see why there are so many misses? 
Is it because you have a long classpath with many JARs on it, and you 
end up searching a lot of JAR files unnecessarily?

If this is the case, I think converting the app to modules will give you 
the speed up. A package can exist only in a single module so lookup is 
fast. You won't have misses unless you intentionally look for things 
that don't exist.

Or, you can just package the app into one giant JAR file.

Thanks
- Ioi

>
> Hits runtime: 7995508 ns, miss runtime: 328360900 ns
> => 2.4 percent of run time is spent on hits.
>
> The startup time reported by Spring Petclinic was 6395 ms. 5 percent of
> this startup time was spent on missed getEntry lookups.
>
> If we can improve the performance of lookup misses, I think there is a good
> potential for overall performance wins.
>
> One idea I had about how this could be done was to use Bloom filters. Bloom
> filters provide a fast, space-efficient, probabilistic data structure which
> may be used to determine that an element is definitely not a member of a
> set.
>
> I made a sloppy Bloom filter implementation and updated
> ZipFile.Source.getEntryPos to check this filter first and return -1 if the
> name is definitely not in the jar.
>
> This gave the following results for Spring Petclinic startup:
>
> Hits run time: 10964825 ns, miss runtime: 233868876 ns
>
> We see that while hits are now 1.4x slower, the total time spent on lookups
> is only 73 percent compared to that of the baseline.
>
> This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
> (assuming single-threaded class loading here).
>
> I expect that this can be improved further with a more clever use of hash
> functions. Specifically it would be great to skip earlier in case of a
> miss, before the String to byte[] encoding happens. Also, I haven't
> analysed the false positive rate in the bloom filter, so that's probably
> open for some tuning.
>
> I also expect lookup misses to be even more common on applications with
> longer class paths and more complex class loader delegation.
>
> Cheers,
> Eirik.