Improving ZipFile.getEntry performance using Bloom filters
Hi, While working on improvements to JarFile.getVersionedEntry, it occurred to me that the missing lookups we are removing there may be just one example of a more general issue. To get a better understanding of how ZipFile.getEntry is used, I added some PerfCounters for the number of hits and misses in calls to ZipFile.Source.getEntryPos() and the elapsed run times for these hits and misses. To test this I used Spring PetClinic. I only care about startup time here, since that's when most of the Zip lookups happens. Here's what I found: Hits 10776, misses: 744642 => 1.4 percent of lookups are hits Hits runtime: 7995508 ns, miss runtime: 328360900 ns => 2.4 percent of run time is spent on hits. The startup time reported by Spring Petclinic was 6395 ms. 5 percent of this startup time was spent on missed getEntry lookups. If we can improve the performance of lookup misses, I think there is a good potential for overall performance wins. One idea I had about how this could be done was to use Bloom filters. Bloom filters provide a fast, space-efficient, probabilistic data structure which may be used to determine that an element is definitely not a member of a set. I made a sloppy Bloom filter implementation and updated ZipFile.Source.getEntryPos to check this filter first and return -1 if the name is definitely not in the jar. This gave the following results for Spring Petclinic startup: Hits run time: 10964825 ns, miss runtime: 233868876 ns We see that while hits are now 1.4x slower, the total time spent on lookups is only 73 percent compared to that of the baseline. This translates to a Petclinic startup improvement of 91 ms or 1.4 percent (assuming single-threaded class loading here). I expect that this can be improved further with a more clever use of hash functions. Specifically it would be great to skip earlier in case of a miss, before the String to byte[] encoding happens. Also, I haven't analysed the false positive rate in the bloom filter, so that's probably open for some tuning. I also expect lookup misses to be even more common on applications with longer class paths and more complex class loader delegation. Cheers, Eirik.
On Mon, Apr 13, 2020 at 9:26 PM Eirik Bjørsnøs <eirbjo@gmail.com> wrote: This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
(assuming single-threaded class loading here).
I expect that this can be improved further with a more clever use of hash functions. Specifically it would be great to skip earlier in case of a miss, before the String to byte[] encoding happens.
By using the String name hash before encoding it to bytes, savings increased to 270 ms, which represents a 4.6 percent performance improvement on Spring Petclinic startup. Eirik.
Here's a patch implementing the Bloom filter idea for JarFile.getEntry(). The patch yields a 63% time saving in ZipFile.getEntry() calls during Spring Petclinic startup, reducing total application startup time by ~5%. (Hits are 7% slower, misses 65% faster) The current patch decodes the name to a String and as a temporary hack assumes UTF-8. This needs to be improved. Would be great if there was a way to calculate a String.hashCode() equivalent looking at a byte array and its encoding without actually decoding it. diff --git a/src/java.base/share/classes/java/util/zip/ZipFile.java b/src/java.base/share/classes/java/util/zip/ZipFile.java index 274e8788d1..754b1fc868 100644 --- a/src/java.base/share/classes/java/util/zip/ZipFile.java +++ b/src/java.base/share/classes/java/util/zip/ZipFile.java @@ -40,6 +40,7 @@ import java.nio.file.Files; import java.util.ArrayDeque; import java.util.ArrayList; import java.util.Arrays; +import java.util.BitSet; import java.util.Collections; import java.util.Deque; import java.util.Enumeration; @@ -336,6 +337,9 @@ public class ZipFile implements ZipConstants, Closeable { Objects.requireNonNull(name, "name"); synchronized (this) { ensureOpen(); + if(!this.res.zsrc.bloomFilter.mightContain(name.hashCode())) { + return null; + } byte[] bname = zc.getBytes(name); int pos = res.zsrc.getEntryPos(bname, true); if (pos != -1) { @@ -1118,6 +1122,8 @@ public class ZipFile implements ZipConstants, Closeable { // referred by their index of their positions in the {@code entries}. // private int[] entries; // array of hashed cen entry + private BloomFilter bloomFilter; + private int addEntry(int index, int hash, int next, int pos) { entries[index++] = hash; entries[index++] = next; @@ -1422,6 +1428,7 @@ public class ZipFile implements ZipConstants, Closeable { int hash = 0; int next = -1; + bloomFilter = BloomFilter.create(total, 0.01); // list for all meta entries ArrayList<Integer> metanamesList = null; @@ -1430,6 +1437,10 @@ public class ZipFile implements ZipConstants, Closeable { int hsh = 0; int pos = 0; int limit = cen.length - ENDHDR; + + // TODO: Use actual charset or get hash without decoding + final ZipCoder zc = ZipCoder.get(UTF_8.INSTANCE); + while (pos + CENHDR <= limit) { if (i >= total) { // This will only happen if the zip file has an incorrect @@ -1452,6 +1463,8 @@ public class ZipFile implements ZipConstants, Closeable { zerror("invalid CEN header (bad header size)"); // Record the CEN offset and the name hash in our hash cell. hash = hashN(cen, pos + CENHDR, nlen); + // TODO: Get String hash without decoding name? + bloomFilter.add(getEntryName(cen, pos, zc).hashCode()); hsh = (hash & 0x7fffffff) % tablelen; next = table[hsh]; table[hsh] = idx; @@ -1478,6 +1491,16 @@ public class ZipFile implements ZipConstants, Closeable { } } + // Used to produce the String.hashCode() of the entry name + private String getEntryName(byte[] cen, int pos, ZipCoder zc) { + int nlen = CENNAM(cen, pos); + if (!zc.isUTF8() && (CENFLG(cen, pos) & USE_UTF8) != 0) { + return zc.toStringUTF8(cen, pos + CENHDR, nlen); + } else { + return zc.toString(cen, pos + CENHDR, nlen); + } + } + private static void zerror(String msg) throws ZipException { throw new ZipException(msg); } @@ -1572,4 +1595,53 @@ public class ZipFile implements ZipConstants, Closeable { return count; } } + + /** + * See "Less Hashing, Same Performance: Building a Better Bloom Filter" + */ + private static class BloomFilter { + + private final BitSet bits; + private final int numBits; + private final int numHashFunctions; + + private BloomFilter(int numBits, int numHashFunctions) { + this.bits = new BitSet(numBits); + this.numBits = numBits; + this.numHashFunctions = numHashFunctions; + } + + static BloomFilter create(int n, double p) { + // See https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives + int numBits = (int) (-n * Math.log(Math.max(p, Double.MIN_VALUE)) / (Math.log(2) * Math.log(2))); + int numHashFunctions = Math.max(1, (int) Math.round((double) numBits / n * Math.log(2))); + return new BloomFilter(numBits, numHashFunctions); + } + + void add(int inputHash) { + for (int i = 1; i <= numHashFunctions; i++) { + int hash = inputHash + i * inputHash; + + if (hash < 0) + hash = ~hash; + + bits.set(hash % numBits); + } + } + + boolean mightContain(int inputHash) { + + for (int i = 1; i <= numHashFunctions; i++) { + int hash = inputHash + i * inputHash; + + if (hash < 0) + hash = ~hash; + + if(!bits.get(hash % numBits)) { + return false; + } + } + return true; + } + } } On Mon, Apr 13, 2020 at 10:57 PM Eirik Bjørsnøs <eirbjo@gmail.com> wrote:
On Mon, Apr 13, 2020 at 9:26 PM Eirik Bjørsnøs <eirbjo@gmail.com> wrote:
This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
(assuming single-threaded class loading here).
I expect that this can be improved further with a more clever use of hash functions. Specifically it would be great to skip earlier in case of a miss, before the String to byte[] encoding happens.
By using the String name hash before encoding it to bytes, savings increased to 270 ms, which represents a 4.6 percent performance improvement on Spring Petclinic startup.
Eirik.
Hi Eirik, interesting idea and results. I wonder if the overhead you remove here is mainly avoiding the need to call byte[] bname = zc.getBytes(name); during lookup. If we refactored ZipCoder to have a zc.getHash(name) method and getEntryPos to decode to a byte[] only lazily, we ought to be able to get most of the speed-up for the miss case, and none (or very little) added cost on hits. And no added footprint for a bloom filter. int pos = res.zsrc.getEntryPos(zc, name, true); ... getEntryPos(ZipCoder zc, String name, boolean addSlash) { if (total == 0) { return -1; } int hash = zc.getHash(name); int idx = table[(hsh & 0x7fffffff) % tablelen]; byte[] bname = null; ... while (idx != ZIP_ENDCHAIN) { if (getEntryHash(idx) == hsh) { if (bname == null) { bname = zc.getBytes(name); } if (bname.length == ... What do you think? Thanks! /Claes On 2020-04-14 12:58, Eirik Bjørsnøs wrote:
Here's a patch implementing the Bloom filter idea for JarFile.getEntry().
The patch yields a 63% time saving in ZipFile.getEntry() calls during Spring Petclinic startup, reducing total application startup time by ~5%. (Hits are 7% slower, misses 65% faster)
The current patch decodes the name to a String and as a temporary hack assumes UTF-8. This needs to be improved.
Would be great if there was a way to calculate a String.hashCode() equivalent looking at a byte array and its encoding without actually decoding it.
diff --git a/src/java.base/share/classes/java/util/zip/ZipFile.java b/src/java.base/share/classes/java/util/zip/ZipFile.java index 274e8788d1..754b1fc868 100644 --- a/src/java.base/share/classes/java/util/zip/ZipFile.java +++ b/src/java.base/share/classes/java/util/zip/ZipFile.java @@ -40,6 +40,7 @@ import java.nio.file.Files; import java.util.ArrayDeque; import java.util.ArrayList; import java.util.Arrays; +import java.util.BitSet; import java.util.Collections; import java.util.Deque; import java.util.Enumeration; @@ -336,6 +337,9 @@ public class ZipFile implements ZipConstants, Closeable { Objects.requireNonNull(name, "name"); synchronized (this) { ensureOpen(); + if(!this.res.zsrc.bloomFilter.mightContain(name.hashCode())) { + return null; + } byte[] bname = zc.getBytes(name); int pos = res.zsrc.getEntryPos(bname, true); if (pos != -1) { @@ -1118,6 +1122,8 @@ public class ZipFile implements ZipConstants, Closeable { // referred by their index of their positions in the {@code entries}. // private int[] entries; // array of hashed cen entry + private BloomFilter bloomFilter; + private int addEntry(int index, int hash, int next, int pos) { entries[index++] = hash; entries[index++] = next; @@ -1422,6 +1428,7 @@ public class ZipFile implements ZipConstants, Closeable { int hash = 0; int next = -1;
+ bloomFilter = BloomFilter.create(total, 0.01); // list for all meta entries ArrayList<Integer> metanamesList = null;
@@ -1430,6 +1437,10 @@ public class ZipFile implements ZipConstants, Closeable { int hsh = 0; int pos = 0; int limit = cen.length - ENDHDR; + + // TODO: Use actual charset or get hash without decoding + final ZipCoder zc = ZipCoder.get(UTF_8.INSTANCE); + while (pos + CENHDR <= limit) { if (i >= total) { // This will only happen if the zip file has an incorrect @@ -1452,6 +1463,8 @@ public class ZipFile implements ZipConstants, Closeable { zerror("invalid CEN header (bad header size)"); // Record the CEN offset and the name hash in our hash cell. hash = hashN(cen, pos + CENHDR, nlen); + // TODO: Get String hash without decoding name? + bloomFilter.add(getEntryName(cen, pos, zc).hashCode()); hsh = (hash & 0x7fffffff) % tablelen; next = table[hsh]; table[hsh] = idx; @@ -1478,6 +1491,16 @@ public class ZipFile implements ZipConstants, Closeable { } }
+ // Used to produce the String.hashCode() of the entry name + private String getEntryName(byte[] cen, int pos, ZipCoder zc) { + int nlen = CENNAM(cen, pos); + if (!zc.isUTF8() && (CENFLG(cen, pos) & USE_UTF8) != 0) { + return zc.toStringUTF8(cen, pos + CENHDR, nlen); + } else { + return zc.toString(cen, pos + CENHDR, nlen); + } + } + private static void zerror(String msg) throws ZipException { throw new ZipException(msg); } @@ -1572,4 +1595,53 @@ public class ZipFile implements ZipConstants, Closeable { return count; } } + + /** + * See "Less Hashing, Same Performance: Building a Better Bloom Filter" + */ + private static class BloomFilter { + + private final BitSet bits; + private final int numBits; + private final int numHashFunctions; + + private BloomFilter(int numBits, int numHashFunctions) { + this.bits = new BitSet(numBits); + this.numBits = numBits; + this.numHashFunctions = numHashFunctions; + } + + static BloomFilter create(int n, double p) { + // See https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives + int numBits = (int) (-n * Math.log(Math.max(p, Double.MIN_VALUE)) / (Math.log(2) * Math.log(2))); + int numHashFunctions = Math.max(1, (int) Math.round((double) numBits / n * Math.log(2))); + return new BloomFilter(numBits, numHashFunctions); + } + + void add(int inputHash) { + for (int i = 1; i <= numHashFunctions; i++) { + int hash = inputHash + i * inputHash; + + if (hash < 0) + hash = ~hash; + + bits.set(hash % numBits); + } + } + + boolean mightContain(int inputHash) { + + for (int i = 1; i <= numHashFunctions; i++) { + int hash = inputHash + i * inputHash; + + if (hash < 0) + hash = ~hash; + + if(!bits.get(hash % numBits)) { + return false; + } + } + return true; + } + } }
On Mon, Apr 13, 2020 at 10:57 PM Eirik Bjørsnøs <eirbjo@gmail.com> wrote:
On Mon, Apr 13, 2020 at 9:26 PM Eirik Bjørsnøs <eirbjo@gmail.com> wrote:
This translates to a Petclinic startup improvement of 91 ms or 1.4 percent
(assuming single-threaded class loading here).
I expect that this can be improved further with a more clever use of hash functions. Specifically it would be great to skip earlier in case of a miss, before the String to byte[] encoding happens.
By using the String name hash before encoding it to bytes, savings increased to 270 ms, which represents a 4.6 percent performance improvement on Spring Petclinic startup.
Eirik.
I wonder if the overhead you remove here is mainly avoiding the need to call byte[] bname = zc.getBytes(name); during lookup.
If I got my numbers right, 67% of the saved time was due to less executions of this method. The remaining 33% should be explained by the bloom filter providing a faster negative lookup than the hash table.
If we refactored ZipCoder to have a zc.getHash(name) method and getEntryPos to decode to a byte[] only lazily, we ought to be able to get most of the speed-up for the miss case, and none (or very little) added cost on hits. And no added footprint for a bloom filter.
What do you think?
Looks like a nice way to get 67% of the savings without the extra footprint (which is ~10 bits per entry IIRC). The read and write sides would need to agree on their hash functions though. The read side is looking on strings and the write side on byte arrays. So you have the same problem I have in my current patch. How would you calculate the hash value? Eirik.
On 2020-04-14 14:06, Eirik Bjørsnøs wrote:
I wonder if the overhead you remove here is mainly avoiding the need to call byte[] bname = zc.getBytes(name); during lookup.
If I got my numbers right, 67% of the saved time was due to less executions of this method.
The remaining 33% should be explained by the bloom filter providing a faster negative lookup than the hash table.
If we refactored ZipCoder to have a zc.getHash(name) method and getEntryPos to decode to a byte[] only lazily, we ought to be able to get most of the speed-up for the miss case, and none (or very little) added cost on hits. And no added footprint for a bloom filter.
What do you think?
Looks like a nice way to get 67% of the savings without the extra footprint (which is ~10 bits per entry IIRC).
Right, not sure the extra footprint (and complexity) would be worth that additional saving.
The read and write sides would need to agree on their hash functions though. The read side is looking on strings and the write side on byte arrays. So you have the same problem I have in my current patch. How would you calculate the hash value?
Yeah, the String encoder APIs doesn't lend themselves well to do hash calculations over a String without taking the overhead of allocating the byte[] (and ByteBuffer, and ...), and we obviously need to ensure we use effectively the same hash function on both ends. One trick is to exploit that the standard UTF-8 encoding is ASCII compatible (all chars 0-127 will encode unchanged), so we can speculatively assume the String is ASCII and calculate the hash code directly using a charAt loop - and return -1 to mark that the String wasn't ASCII and needs to be encoded. The false positive case when hash actually is -1 will be handled gracefully: http://cr.openjdk.java.net/~redestad/scratch/getEntry_ascii_fastpath.00/ Almost all entries in common libraries are likely to have names which are ASCII-compat, so this should enable the speed-up for most cases with little complexity. /Claes
Eirik.
On Tue, Apr 14, 2020 at 2:06 PM Eirik Bjørsnøs <eirbjo@gmail.com> wrote:
The remaining 33% should be explained by the bloom filter providing a faster negative lookup than the hash table.
This made me wonder if a hash map is the optimal data structure for this kind of read-only lookups. There's good amount of pointer chasing and scanning going on and it's probably not very cache friendly. My intuition is that an array sorted by lookup key plus a binary search for lookups could be faster. Eirik.
On 4/13/20 12:26 PM, Eirik Bjørsnøs wrote:
Hi,
While working on improvements to JarFile.getVersionedEntry, it occurred to me that the missing lookups we are removing there may be just one example of a more general issue.
To get a better understanding of how ZipFile.getEntry is used, I added some PerfCounters for the number of hits and misses in calls to ZipFile.Source.getEntryPos() and the elapsed run times for these hits and misses.
To test this I used Spring PetClinic. I only care about startup time here, since that's when most of the Zip lookups happens.
Here's what I found:
Hits 10776, misses: 744642 => 1.4 percent of lookups are hits
I am a little worried adding extra overhead unconditionally into the JAR reader that people may not need/want. Maybe we should take a step back and see why there are so many misses? Is it because you have a long classpath with many JARs on it, and you end up searching a lot of JAR files unnecessarily? If this is the case, I think converting the app to modules will give you the speed up. A package can exist only in a single module so lookup is fast. You won't have misses unless you intentionally look for things that don't exist. Or, you can just package the app into one giant JAR file. Thanks - Ioi
Hits runtime: 7995508 ns, miss runtime: 328360900 ns => 2.4 percent of run time is spent on hits.
The startup time reported by Spring Petclinic was 6395 ms. 5 percent of this startup time was spent on missed getEntry lookups.
If we can improve the performance of lookup misses, I think there is a good potential for overall performance wins.
One idea I had about how this could be done was to use Bloom filters. Bloom filters provide a fast, space-efficient, probabilistic data structure which may be used to determine that an element is definitely not a member of a set.
I made a sloppy Bloom filter implementation and updated ZipFile.Source.getEntryPos to check this filter first and return -1 if the name is definitely not in the jar.
This gave the following results for Spring Petclinic startup:
Hits run time: 10964825 ns, miss runtime: 233868876 ns
We see that while hits are now 1.4x slower, the total time spent on lookups is only 73 percent compared to that of the baseline.
This translates to a Petclinic startup improvement of 91 ms or 1.4 percent (assuming single-threaded class loading here).
I expect that this can be improved further with a more clever use of hash functions. Specifically it would be great to skip earlier in case of a miss, before the String to byte[] encoding happens. Also, I haven't analysed the false positive rate in the bloom filter, so that's probably open for some tuning.
I also expect lookup misses to be even more common on applications with longer class paths and more complex class loader delegation.
Cheers, Eirik.
On 2020-04-15 00:34, Ioi Lam wrote:
I am a little worried adding extra overhead unconditionally into the JAR reader that people may not need/want.
Maybe we should take a step back and see why there are so many misses? Is it because you have a long classpath with many JARs on it, and you end up searching a lot of JAR files unnecessarily?
If this is the case, I think converting the app to modules will give you the speed up. A package can exist only in a single module so lookup is fast. You won't have misses unless you intentionally look for things that don't exist.
Or, you can just package the app into one giant JAR file.
Yes, I think the constraint on any optimizations here is that it shouldn't regress anything we can think of. Especially not already- optimized deployments - be they using modularized apps, fat jars or AppCDS. But I think we should improve the lowest common denominator OOTB experience when we can. And after some back and forth with Eirik off- list I think there are improvements we can do in this area which will be wins for many and a regression for no-one (at least not in any detectable way). A bloom filter might not be it, since it would add noticeable overhead for fat jar deployments. So... stay tuned..? :-) Thanks! /Claes
On 14/04/2020 23:34, Ioi Lam wrote:
I am a little worried adding extra overhead unconditionally into the JAR reader that people may not need/want.
A reasonable worry. We should always try to avoid fixes that benefit a majority if they 'punish' a minority . . .
Maybe we should take a step back and see why there are so many misses? Is it because you have a long classpath with many JARs on it, and you end up searching a lot of JAR files unnecessarily?
Well, there are a fair few as can be identified simply by Googling the pom https://github.com/spring-projects/spring-petclinic/blob/master/pom.xml and reading the dependencies section. n.b. that only shows top-level dependencies but not recursive ones. Unfortunately, this is going to be the reality for a many existing and new apps. Most production Java apps are built from many library jars. Java is the biggest 'divide and conquer' programming eco-system we have ever seen in the history of computing. That's a direct corollary of it being the biggest eco-system we have ever seen -- scale /necessitates/ divide and conquer.
If this is the case, I think converting the app to modules will give you the speed up. A package can exist only in a single module so lookup is fast. You won't have misses unless you intentionally look for things that don't exist.
Well, yes but that also is just not going to fly for the majority of Java developers for the reason given above. Most app developers are not in a position to modularize their apps because the libraries they depend on are not modularized, because the libraries /they/ depend on are not modularized, because the libraries /they/ depend on are not modularized ... and so on. It's rarely going to be one group or organization with one fixed goal that would need to schedule and implement such a change. Now, you may lament that situation (or not) but it /is/ a brute fact and is going to remain the status quo for a very long time. An eco-system of the size of Java is like an ocean-liner. Which means the above advice is going to whistle over the heads of many Java developers.
Or, you can just package the app into one giant JAR file. Again, that completely misses the reality of most developer's circumstances.
Now, I hope I don't come across like I am simply being negative here. I have posted this reply because it's critically important that we, the OpenJDK project devs, understand and keep in mind how most app developers use (are able to use) Java. Suggestions that bypass the realities and limitations of that usage say to me that we are at risk of not meeting those users' needs. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill
Unfortunately, this is going to be the reality for a many existing and new apps. Most production Java apps are built from many library jars. Java is the biggest 'divide and conquer' programming eco-system we have ever seen in the history of computing.
Andrew, Thanks for providing some context! This resonated well with my 15 years experience Java enterprise development. Just to clarify: My goal with this effort is not to speed up Spring Petclinic. It is to improve life for the larger Java ecosystem as described by Andrew. So while advice on how to speed up Spring Petclinic is nice, it does kind of miss the point. Cheers, Eirik.
participants (4)
-
Andrew Dinn
-
Claes Redestad
-
Eirik Bjørsnøs
-
Ioi Lam