RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos

Mon Feb 6 15:03:41 UTC 2023

On Sun, 5 Feb 2023 22:13:50 GMT, Claes Redestad <redestad at openjdk.org> wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up to the encoded entry name in the CEN. This comparison is done by decoding the entry name into a String. The names can then be compared using the String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case where  both the lookup name and the entry name are encoded in ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded byte array range.  The default implementation decodes to string (like the current code), while the UTF-8 implementation uses JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN already checks this and will throws IAE for invalid UTF-8). This seems to give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. We keep track of any "name/" match and return it at the end of the search. (I feel this is easier to follow and it also gives a ~30% speedup for addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might reduce reviewer overhead given that it touches the exact same code. I might be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark                             (size)  Mode  Cnt    Score   Error  Units
>> ZipFileGetEntry.getEntryHit              512  avgt   15   74.941 ± 1.004  ns/op
>> ZipFileGetEntry.getEntryHit             1024  avgt   15   84.943 ± 1.320  ns/op
>> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  120.371 ± 2.386  ns/op
>> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  126.128 ± 1.075  ns/op
>> ZipFileGetEntry.getEntryMiss             512  avgt   15   23.818 ± 0.838  ns/op
>> ZipFileGetEntry.getEntryMiss            1024  avgt   15   29.762 ± 5.998  ns/op
>> ZipFileGetEntry.getEntryMissUncached     512  avgt   15   59.405 ± 0.545  ns/op
>> ZipFileGetEntry.getEntryMissUncached    1024  avgt   15   71.840 ± 2.455  ns/op
>> ZipFileGetEntry.getEntrySlash            512  avgt   15  135.621 ± 4.341  ns/op
>> ZipFileGetEntry.getEntrySlash           1024  avgt   15  134.190 ± 2.141  ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark                             (size)  Mode  Cnt    Score   Error  Units
>> ZipFileGetEntry.getEntryHit              512  avgt   15   62.267 ± 1.329  ns/op
>> ZipFileGetEntry.getEntryHit             1024  avgt   15   72.916 ± 2.428  ns/op
>> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  101.630 ± 1.154  ns/op
>> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  113.161 ± 0.502  ns/op
>> ZipFileGetEntry.getEntryMiss             512  avgt   15   23.003 ± 1.191  ns/op
>> ZipFileGetEntry.getEntryMiss            1024  avgt   15   23.236 ± 1.114  ns/op
>> ZipFileGetEntry.getEntryMissUncached     512  avgt   15   56.781 ± 1.505  ns/op
>> ZipFileGetEntry.getEntryMissUncached    1024  avgt   15   67.767 ± 1.963  ns/op
>> ZipFileGetEntry.getEntrySlash            512  avgt   15   73.745 ± 2.717  ns/op
>> ZipFileGetEntry.getEntrySlash           1024  avgt   15   75.784 ± 1.051  ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which measures the total time of calling ZipFile.getEntry for all entries in the 109 JAR file dependenies of spring-petclinic. The shows a nice improvement (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %          23155 21149
>> 75 %          23598 21454
>> 90 %          23989 21691
>> 95 %          24238 21973
>> 99 %          25270 22446
>> STDEV           792   549
>> Count           500   500
>
> src/java.base/share/classes/java/lang/System.java line 2668:
> 
>> 2666:             @Override
>> 2667:             public int mismatchUTF8(String str, byte[] b, int fromIndex, int toIndex) {
>> 2668:                 byte[] encoded = str.isLatin1() ? str.value() : str.getBytes(UTF_8.INSTANCE);
> 
> I think this is incorrect: latin-1 characters above codepoint 127 (non-ascii) would be represented by 2 bytes in UTF-8. What you want here is probably `str.isAscii() ? ...`. The ASCII check will have to look at the bytes, so will incur a minor penalty.
> 
> Good news is that you should already be able to do this with what's already exposed via `JLA.getBytesNoRepl(str, StandardCharsets.UTF_8)`, so no need for more shared secrets.

Nice, I have updated the PR such that the new shared secret is replaced with using getBytesNoRepl instead. If there is a performance difference, it seems to hide in the noise.

I had expected such a regression to be caught by existing tests, which seems not to be the case. I added TestZipFileEncodings.latin1NotAscii to adress this.

> src/java.base/share/classes/java/lang/System.java line 2671:
> 
>> 2669:                 if (false) {
>> 2670:                     // Arrays.mismatch without the range checks (~5% faster micro getEntryHit)
>> 2671:                     int aLength = encoded.length;
> 
> Part of the difference you're seeing is due to knowing that you'll be matching the entire length of the first array (`encoded, 0, encoded.length`).
> 
> As an experiment I added `Arrays.mismatch(byte[], byte[], int, int)` to mismatch the entire range of the first array argument vs the range of the second and can spot an improvement in affected micros:
> 
> Benchmark                                     (size)  Mode  Cnt   Score   Error  Units
> ArraysMismatch.Char.differentSubrangeMatches      90  avgt   10  12.165 ± 0.074  ns/op # mismatch(a, aFrom, aTo, b, bFrom, bTo)
> ArraysMismatch.Char.subrangeMatches               90  avgt   10  10.748 ± 0.006  ns/op # mismatch(a, b, bFrom, bTo)
> 
> This might be something we can solve in the JITs without having to add new methods to `java.util.Arrays` to deal as efficiently as possible with the case when we're matching against the entirety of one of the arrays.

Interesting. Would be nice to solve this in the JIT!

This disabled code got deleted in my last commit, but it seems like you have a good analysis so we can let it go now.

-------------

PR: https://git.openjdk.org/jdk/pull/12290