RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v9]

Lance Andersen lancea at openjdk.org
Mon Feb 13 21:25:38 UTC 2023


On Thu, 9 Feb 2023 12:07:04 GMT, Eirik Bjorsnos <duke at openjdk.org> wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up to the encoded entry name in the CEN. This comparison is done by decoding the entry name into a String. The names can then be compared using the String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case where  both the lookup name and the entry name are encoded in ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded byte array range.  The default implementation decodes to string (like the current code), while the UTF-8 implementation uses JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN already checks this and will throws IAE for invalid UTF-8). This seems to give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. We keep track of any "name/" match and return it at the end of the search. (I feel this is easier to follow and it also gives a ~30% speedup for addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might reduce reviewer overhead given that it touches the exact same code. I might be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark                             (size)  Mode  Cnt    Score   Error  Units
>> ZipFileGetEntry.getEntryHit              512  avgt   15   74.941 ± 1.004  ns/op
>> ZipFileGetEntry.getEntryHit             1024  avgt   15   84.943 ± 1.320  ns/op
>> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  120.371 ± 2.386  ns/op
>> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  126.128 ± 1.075  ns/op
>> ZipFileGetEntry.getEntryMiss             512  avgt   15   23.818 ± 0.838  ns/op
>> ZipFileGetEntry.getEntryMiss            1024  avgt   15   29.762 ± 5.998  ns/op
>> ZipFileGetEntry.getEntryMissUncached     512  avgt   15   59.405 ± 0.545  ns/op
>> ZipFileGetEntry.getEntryMissUncached    1024  avgt   15   71.840 ± 2.455  ns/op
>> ZipFileGetEntry.getEntrySlash            512  avgt   15  135.621 ± 4.341  ns/op
>> ZipFileGetEntry.getEntrySlash           1024  avgt   15  134.190 ± 2.141  ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark                             (size)  Mode  Cnt    Score   Error  Units
>> ZipFileGetEntry.getEntryHit              512  avgt   15   62.267 ± 1.329  ns/op
>> ZipFileGetEntry.getEntryHit             1024  avgt   15   72.916 ± 2.428  ns/op
>> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  101.630 ± 1.154  ns/op
>> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  113.161 ± 0.502  ns/op
>> ZipFileGetEntry.getEntryMiss             512  avgt   15   23.003 ± 1.191  ns/op
>> ZipFileGetEntry.getEntryMiss            1024  avgt   15   23.236 ± 1.114  ns/op
>> ZipFileGetEntry.getEntryMissUncached     512  avgt   15   56.781 ± 1.505  ns/op
>> ZipFileGetEntry.getEntryMissUncached    1024  avgt   15   67.767 ± 1.963  ns/op
>> ZipFileGetEntry.getEntrySlash            512  avgt   15   73.745 ± 2.717  ns/op
>> ZipFileGetEntry.getEntrySlash           1024  avgt   15   75.784 ± 1.051  ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which measures the total time of calling ZipFile.getEntry for all entries in the 109 JAR file dependenies of spring-petclinic. The shows a nice improvement (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %          23155 21149
>> 75 %          23598 21454
>> 90 %          23989 21691
>> 95 %          24238 21973
>> 99 %          25270 22446
>> STDEV           792   549
>> Count           500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Revert accidental removal of UTF8ZipCoder.compare

Thank you Eirik for your efforts here.  I am running your latest changes through our internal Mach5 systems.

Additional comments below which are meant to add clarity for future maintainers

src/java.base/share/classes/java/util/zip/ZipCoder.java line 66:

> 64:         NO_MATCH
> 65:     }
> 66: 

Please add a comment indicating what the values mean

src/java.base/share/classes/java/util/zip/ZipCoder.java line 210:

> 208:      * is known and matches the charset of this ZipCoder.
> 209:      */
> 210:     Comparison compare(String str, byte[] b, int off, int len, boolean addSlash) {

If you could add an `@param` comments, that would be awesome 😎

src/java.base/share/classes/java/util/zip/ZipFile.java line 1665:

> 1663:             }
> 1664:             // No exact match found, will return either slashMatch or -1
> 1665:             return slashMatch;

This gets a bit confusing as we return `pos` when we have an exact match so it would be helpful to had more clarity via additional comments(it might not have been clear with the previous comments but I think if we are going to add `slashMatch` we should take the time to beef up the comments

test/jdk/java/util/zip/ZipFile/InvalidBytesInEntryNameOrComment.java line 90:

> 88:             expectedExceptionsMessageRegExp = BAD_ENTRY_NAME_OR_COMMENT)
> 89:     public void shouldRejectInvalidName() throws IOException {
> 90:         try (ZipFile zf = new ZipFile(invalidName.toFile())) {

If you could please convert to use `expectThrows` to get to validate the message name

test/jdk/java/util/zip/ZipFile/TestZipFileEncodings.java line 127:

> 125:     }
> 126: 
> 127:     @Test

Please add a comment introducing the test

test/jdk/java/util/zip/ZipFile/TestZipFileEncodings.java line 143:

> 141:         }
> 142:     }
> 143:     @Test(dataProvider = "all-charsets")

Please add a comment introducing the test

test/jdk/java/util/zip/ZipFile/TestZipFileEncodings.java line 307:

> 305:                 ze.setCrc(crc.getValue());
> 306:             }
> 307:             ze.setTime(1675862371399L);

Please add a comment indicating what the time is

-------------

PR: https://git.openjdk.org/jdk/pull/12290



More information about the security-dev mailing list