Improving ZipFile.getEntry performance using Bloom filters

Eirik Bjørsnøs eirbjo at gmail.com
Tue Apr 14 12:06:21 UTC 2020


>
> I wonder if the overhead you remove here is mainly avoiding the need to
> call byte[] bname = zc.getBytes(name); during lookup.
>

If I got my numbers right, 67% of the saved time was due to less executions
of this method.

The remaining 33% should be explained by the bloom filter providing a
faster negative lookup than the hash table.


> If we refactored ZipCoder to have a zc.getHash(name) method and
> getEntryPos to decode to a byte[] only lazily, we ought to be able to
> get most of the speed-up for the miss case, and none (or very little)
> added cost on hits. And no added footprint for a bloom filter.
>
> What do you think?
>

Looks like a nice way to get 67% of the savings without the extra footprint
(which is ~10 bits per entry IIRC).

The read and write sides would need to agree on their hash functions
though. The read side is looking on strings and the write side on byte
arrays. So you have the same problem I have in my current patch. How would
you calculate the hash value?

Eirik.


More information about the core-libs-dev mailing list