Re: Improving ZipFile.getEntry performance using Bloom filters

14 Apr 2020

      On 4/13/20 12:26 PM, Eirik Bjørsnøs wrote:
...
Hi,
While working on improvements to JarFile.getVersionedEntry, it occurred to
me that the missing lookups we are removing there may be just one example
of a more general issue.
To get a better understanding of how ZipFile.getEntry is used, I added some
PerfCounters for the number of hits and misses in calls to
ZipFile.Source.getEntryPos() and the elapsed run times for these hits and
misses.
To test this I used Spring PetClinic. I only care about startup time here,
since that's when most of the Zip lookups happens.
Here's what I found:
Hits 10776, misses: 744642
=> 1.4 percent of lookups are hits
I am a little worried adding extra overhead unconditionally into the JAR 
reader that people may not need/want.

Maybe we should take a step back and see why there are so many misses? 
Is it because you have a long classpath with many JARs on it, and you 
end up searching a lot of JAR files unnecessarily?

If this is the case, I think converting the app to modules will give you 
the speed up. A package can exist only in a single module so lookup is 
fast. You won't have misses unless you intentionally look for things 
that don't exist.

Or, you can just package the app into one giant JAR file.

Thanks
- Ioi
...
Hits runtime: 7995508 ns, miss runtime: 328360900 ns
=> 2.4 percent of run time is spent on hits.
The startup time reported by Spring Petclinic was 6395 ms. 5 percent of
this startup time was spent on missed getEntry lookups.
If we can improve the performance of lookup misses, I think there is a good
potential for overall performance wins.
One idea I had about how this could be done was to use Bloom filters. Bloom
filters provide a fast, space-efficient, probabilistic data structure which
may be used to determine that an element is definitely not a member of a
set.
I made a sloppy Bloom filter implementation and updated
ZipFile.Source.getEntryPos to check this filter first and return -1 if the
name is definitely not in the jar.
This gave the following results for Spring Petclinic startup:
Hits run time: 10964825 ns, miss runtime: 233868876 ns
We see that while hits are now 1.4x slower, the total time spent on lookups
is only 73 percent compared to that of the baseline.
This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
(assuming single-threaded class loading here).
I expect that this can be improved further with a more clever use of hash
functions. Specifically it would be great to skip earlier in case of a
miss, before the String to byte[] encoding happens. Also, I haven't
analysed the false positive rate in the bloom filter, so that's probably
open for some tuning.
I also expect lookup misses to be even more common on applications with
longer class paths and more complex class loader delegation.
Cheers,
Eirik.

Re: Improving ZipFile.getEntry performance using Bloom filters

Ioi Lam