Improving ZipFile.getEntry performance using Bloom filters
Claes Redestad
claes.redestad at oracle.com
Tue Apr 14 12:30:36 UTC 2020
On 2020-04-14 14:06, Eirik Bjørsnøs wrote:
> I wonder if the overhead you remove here is mainly avoiding the need to
> call byte[] bname = zc.getBytes(name); during lookup.
>
>
> If I got my numbers right, 67% of the saved time was due to less
> executions of this method.
>
> The remaining 33% should be explained by the bloom filter providing a
> faster negative lookup than the hash table.
>
> If we refactored ZipCoder to have a zc.getHash(name) method and
> getEntryPos to decode to a byte[] only lazily, we ought to be able to
> get most of the speed-up for the miss case, and none (or very little)
> added cost on hits. And no added footprint for a bloom filter.
>
> What do you think?
>
>
> Looks like a nice way to get 67% of the savings without the extra
> footprint (which is ~10 bits per entry IIRC).
Right, not sure the extra footprint (and complexity) would be worth that
additional saving.
>
> The read and write sides would need to agree on their hash functions
> though. The read side is looking on strings and the write side on byte
> arrays. So you have the same problem I have in my current patch. How
> would you calculate the hash value?
Yeah, the String encoder APIs doesn't lend themselves well to do hash
calculations over a String without taking the overhead of allocating
the byte[] (and ByteBuffer, and ...), and we obviously need to ensure
we use effectively the same hash function on both ends.
One trick is to exploit that the standard UTF-8 encoding is ASCII
compatible (all chars 0-127 will encode unchanged), so we can
speculatively assume the String is ASCII and calculate the hash code
directly using a charAt loop - and return -1 to mark that the String
wasn't ASCII and needs to be encoded. The false positive case when hash
actually is -1 will be handled gracefully:
http://cr.openjdk.java.net/~redestad/scratch/getEntry_ascii_fastpath.00/
Almost all entries in common libraries are likely to have names which
are ASCII-compat, so this should enable the speed-up for most cases with
little complexity.
/Claes
>
> Eirik.
More information about the core-libs-dev
mailing list