Improving ZipFile.getEntry performance using Bloom filters

Tue Apr 14 12:30:36 UTC 2020

On 2020-04-14 14:06, Eirik Bjørsnøs wrote:
>     I wonder if the overhead you remove here is mainly avoiding the need to
>     call byte[] bname = zc.getBytes(name); during lookup.
> 
> 
> If I got my numbers right, 67% of the saved time was due to less 
> executions of this method.
> 
> The remaining 33% should be explained by the bloom filter providing a 
> faster negative lookup than the hash table.
> 
>     If we refactored ZipCoder to have a zc.getHash(name) method and
>     getEntryPos to decode to a byte[] only lazily, we ought to be able to
>     get most of the speed-up for the miss case, and none (or very little)
>     added cost on hits. And no added footprint for a bloom filter.
> 
>     What do you think?
> 
> 
> Looks like a nice way to get 67% of the savings without the extra 
> footprint (which is ~10 bits per entry IIRC).

Right, not sure the extra footprint (and complexity) would be worth that 
additional saving.

> 
> The read and write sides would need to agree on their hash functions 
> though. The read side is looking on strings and the write side on byte 
> arrays. So you have the same problem I have in my current patch. How 
> would you calculate the hash value?

Yeah, the String encoder APIs doesn't lend themselves well to do hash
calculations over a String without taking the overhead of allocating
the byte[] (and ByteBuffer, and ...), and we obviously need to ensure
we use effectively the same hash function on both ends.

One trick is to exploit that the standard UTF-8 encoding is ASCII
compatible (all chars 0-127 will encode unchanged), so we can
speculatively assume the String is ASCII and calculate the hash code
directly using a charAt loop - and return -1 to mark that the String
wasn't ASCII and needs to be encoded. The false positive case when hash
actually is -1 will be handled gracefully:

http://cr.openjdk.java.net/~redestad/scratch/getEntry_ascii_fastpath.00/

Almost all entries in common libraries are likely to have names which
are ASCII-compat, so this should enable the speed-up for most cases with
little complexity.

/Claes

> 
> Eirik.