RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v14]
Eirik Bjorsnos
duke at openjdk.org
Wed Feb 15 10:13:51 UTC 2023
> After finding a hash match, getEntryPos needs to compare the lookup name up to the encoded entry name in the CEN. This comparison is done by decoding the entry name into a String. The names can then be compared using the String API. This decoding step adds a significat cost to this method.
>
> This PR suggest to update the string comparison such that in the common case where both the lookup name and the entry name are encoded in ASCII-compatible UTF-8, decoding can be avoided and the byte arrays can instead be compared direcly.
>
> ZipCoder is updated with a new method to compare a string with an encoded byte array range. The default implementation decodes to string (like the current code), while the UTF-8 implementation uses JavaLangAccess.getBytesNoRepl to get the bytes. Both methods thes uses Arrays.mismatch for comparison with or without matching trailing slashes.
>
> Additionally, this PR suggest to make the following updates to getEntryPos:
>
> - The try/catch for IAE is redundant and can be safely removed. (initCEN already checks this and will throws IAE for invalid UTF-8). This seems to give a 3-4% speedup on micros)
> - A new test InvalidBytesInEntryNameOrComment is a added to verify that initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I found no existing test coverage for this)
> - The recursion when looking for "name/" matches is replaced with iteration. We keep track of any "name/" match and return it at the end of the search. (I feel this is easier to follow and it also gives a ~30% speedup for addSlash lookups with no regression on regular lookups)
>
> (My though is that including these additional updates in this PR might reduce reviewer overhead given that it touches the exact same code. I might be wrong on this, please advise :)
>
> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified to use xalan.jar):
>
> Baseline:
>
> Benchmark (size) Mode Cnt Score Error Units
> ZipFileGetEntry.getEntryHit 512 avgt 15 74.941 ± 1.004 ns/op
> ZipFileGetEntry.getEntryHit 1024 avgt 15 84.943 ± 1.320 ns/op
> ZipFileGetEntry.getEntryHitUncached 512 avgt 15 120.371 ± 2.386 ns/op
> ZipFileGetEntry.getEntryHitUncached 1024 avgt 15 126.128 ± 1.075 ns/op
> ZipFileGetEntry.getEntryMiss 512 avgt 15 23.818 ± 0.838 ns/op
> ZipFileGetEntry.getEntryMiss 1024 avgt 15 29.762 ± 5.998 ns/op
> ZipFileGetEntry.getEntryMissUncached 512 avgt 15 59.405 ± 0.545 ns/op
> ZipFileGetEntry.getEntryMissUncached 1024 avgt 15 71.840 ± 2.455 ns/op
> ZipFileGetEntry.getEntrySlash 512 avgt 15 135.621 ± 4.341 ns/op
> ZipFileGetEntry.getEntrySlash 1024 avgt 15 134.190 ± 2.141 ns/op
>
>
>
> PR:
>
>
> Benchmark (size) Mode Cnt Score Error Units
> ZipFileGetEntry.getEntryHit 512 avgt 15 62.267 ± 1.329 ns/op
> ZipFileGetEntry.getEntryHit 1024 avgt 15 72.916 ± 2.428 ns/op
> ZipFileGetEntry.getEntryHitUncached 512 avgt 15 101.630 ± 1.154 ns/op
> ZipFileGetEntry.getEntryHitUncached 1024 avgt 15 113.161 ± 0.502 ns/op
> ZipFileGetEntry.getEntryMiss 512 avgt 15 23.003 ± 1.191 ns/op
> ZipFileGetEntry.getEntryMiss 1024 avgt 15 23.236 ± 1.114 ns/op
> ZipFileGetEntry.getEntryMissUncached 512 avgt 15 56.781 ± 1.505 ns/op
> ZipFileGetEntry.getEntryMissUncached 1024 avgt 15 67.767 ± 1.963 ns/op
> ZipFileGetEntry.getEntrySlash 512 avgt 15 73.745 ± 2.717 ns/op
> ZipFileGetEntry.getEntrySlash 1024 avgt 15 75.784 ± 1.051 ns/op
>
>
> To assess the impact on startup/warmup, I made a main method class which measures the total time of calling ZipFile.getEntry for all entries in the 109 JAR file dependenies of spring-petclinic. The shows a nice improvement (time in micros):
>
>
> Percentile Baseline Patch
> 50 % 23155 21149
> 75 % 23598 21454
> 90 % 23989 21691
> 95 % 24238 21973
> 99 % 25270 22446
> STDEV 792 549
> Count 500 500
Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision:
Update javadocs for ZipCoder.Comparison to first clarify that the "encoded string" refers to those found in the CEN byte array and to use the term "the encoded string" instead of the more convoluted "the string encoded in the byte array".
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/12290/files
- new: https://git.openjdk.org/jdk/pull/12290/files/49ad8d35..9b83002a
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=12290&range=13
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=12290&range=12-13
Stats: 10 lines in 1 file changed: 0 ins; 2 del; 8 mod
Patch: https://git.openjdk.org/jdk/pull/12290.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/12290/head:pull/12290
PR: https://git.openjdk.org/jdk/pull/12290
More information about the security-dev
mailing list