RFR: 8259498: Reduce overhead of MD5 and SHA digests
Claes Redestad
redestad at openjdk.java.net
Fri Jan 15 23:24:07 UTC 2021
On Fri, 15 Jan 2021 22:54:32 GMT, Valerie Peng <valeriep at openjdk.org> wrote:
>> - The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration
>> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization.
>> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
>>
>> Baseline:
>> (digesterName) (length) Cnt Score Error Units
>> MessageDigests.digest MD5 16 15 2714.307 ± 21.133 ops/ms
>> MessageDigests.digest MD5 1024 15 318.087 ± 0.637 ops/ms
>> MessageDigests.digest SHA-1 16 15 1387.266 ± 40.932 ops/ms
>> MessageDigests.digest SHA-1 1024 15 109.273 ± 0.149 ops/ms
>> MessageDigests.digest SHA-256 16 15 995.566 ± 21.186 ops/ms
>> MessageDigests.digest SHA-256 1024 15 89.104 ± 0.079 ops/ms
>> MessageDigests.digest SHA-512 16 15 803.030 ± 15.722 ops/ms
>> MessageDigests.digest SHA-512 1024 15 115.611 ± 0.234 ops/ms
>> MessageDigests.getAndDigest MD5 16 15 2190.367 ± 97.037 ops/ms
>> MessageDigests.getAndDigest MD5 1024 15 302.903 ± 1.809 ops/ms
>> MessageDigests.getAndDigest SHA-1 16 15 1262.656 ± 43.751 ops/ms
>> MessageDigests.getAndDigest SHA-1 1024 15 104.889 ± 3.554 ops/ms
>> MessageDigests.getAndDigest SHA-256 16 15 914.541 ± 55.621 ops/ms
>> MessageDigests.getAndDigest SHA-256 1024 15 85.708 ± 1.394 ops/ms
>> MessageDigests.getAndDigest SHA-512 16 15 737.719 ± 53.671 ops/ms
>> MessageDigests.getAndDigest SHA-512 1024 15 112.307 ± 1.950 ops/ms
>>
>> GC:
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 312.011 ± 0.005 B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.020 ± 0.006 B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 544.019 ± 0.016 B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 1056.037 ± 0.003 B/op
>>
>> Target:
>> Benchmark (digesterName) (length) Cnt Score Error Units
>> MessageDigests.digest MD5 16 15 3134.462 ± 43.685 ops/ms
>> MessageDigests.digest MD5 1024 15 323.667 ± 0.633 ops/ms
>> MessageDigests.digest SHA-1 16 15 1418.742 ± 38.223 ops/ms
>> MessageDigests.digest SHA-1 1024 15 110.178 ± 0.788 ops/ms
>> MessageDigests.digest SHA-256 16 15 1037.949 ± 21.214 ops/ms
>> MessageDigests.digest SHA-256 1024 15 89.671 ± 0.228 ops/ms
>> MessageDigests.digest SHA-512 16 15 812.028 ± 39.489 ops/ms
>> MessageDigests.digest SHA-512 1024 15 116.738 ± 0.249 ops/ms
>> MessageDigests.getAndDigest MD5 16 15 2314.379 ± 229.294 ops/ms
>> MessageDigests.getAndDigest MD5 1024 15 307.835 ± 5.730 ops/ms
>> MessageDigests.getAndDigest SHA-1 16 15 1326.887 ± 63.263 ops/ms
>> MessageDigests.getAndDigest SHA-1 1024 15 106.611 ± 2.292 ops/ms
>> MessageDigests.getAndDigest SHA-256 16 15 961.589 ± 82.052 ops/ms
>> MessageDigests.getAndDigest SHA-256 1024 15 88.646 ± 0.194 ops/ms
>> MessageDigests.getAndDigest SHA-512 16 15 775.417 ± 56.775 ops/ms
>> MessageDigests.getAndDigest SHA-512 1024 15 112.904 ± 2.014 ops/ms
>>
>> GC
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm MD5 16 15 232.009 ± 0.006 B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-1 16 15 584.021 ± 0.001 B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-256 16 15 272.012 ± 0.015 B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm SHA-512 16 15 400.017 ± 0.019 B/op
>>
>> For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
>>
>> For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
>>
>> I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
>
> src/java.base/share/classes/sun/security/provider/ByteArrayAccess.java line 214:
>
>
> Why do we remove the index checking from all methods? Isn't it safer to check here in case the caller didn't? Or is it such checking is already implemented inside the the various methods of VarHandle?
Yes, IOOBE checking is done by the VarHandle methods, while the Unsafe API is unsafe and needs careful precondition checking. It doesn't seem to matter for performance (interpreted code sees some benefit by the removal).
With the current usage an IOOBE is probably not observable, but there's a test that reflects into ByteArrayAccess and verifies exceptions are thrown as expected on faulty inputs.
-------------
PR: https://git.openjdk.java.net/jdk/pull/1855
More information about the security-dev
mailing list