RFR: 8259498: Reduce overhead of MD5 and SHA digests

Fri Jan 15 23:24:07 UTC 2021

On Fri, 15 Jan 2021 22:54:32 GMT, Valerie Peng <valeriep at openjdk.org> wrote:

>> - The MD5 intrinsics added by [JDK-8250902](https://bugs.openjdk.java.net/browse/JDK-8250902) shows that the `int[] x` isn't actually needed. This also applies to the SHA intrinsics from which the MD5 intrinsic takes inspiration
>> - Using VarHandles we can simplify the code in `ByteArrayAccess` enough to make it acceptable to use inline and replace the array in MD5 wholesale. This improves performance both in the presence and the absence of the intrinsic optimization.
>> - Doing the exact same thing in the SHA impls would be unwieldy (64+ element arrays), but allocating the array lazily gets most of the speed-up in the presence of an intrinsic while being neutral in its absence.
>> 
>> Baseline:
>>                               (digesterName)  (length)    Cnt     Score      Error   Units
>> MessageDigests.digest                                MD5        16     15  2714.307 ±   21.133  ops/ms
>> MessageDigests.digest                                MD5      1024     15   318.087 ±    0.637  ops/ms
>> MessageDigests.digest                              SHA-1        16     15  1387.266 ±   40.932  ops/ms
>> MessageDigests.digest                              SHA-1      1024     15   109.273 ±    0.149  ops/ms
>> MessageDigests.digest                            SHA-256        16     15   995.566 ±   21.186  ops/ms
>> MessageDigests.digest                            SHA-256      1024     15    89.104 ±    0.079  ops/ms
>> MessageDigests.digest                            SHA-512        16     15   803.030 ±   15.722  ops/ms
>> MessageDigests.digest                            SHA-512      1024     15   115.611 ±    0.234  ops/ms
>> MessageDigests.getAndDigest                          MD5        16     15  2190.367 ±   97.037  ops/ms
>> MessageDigests.getAndDigest                          MD5      1024     15   302.903 ±    1.809  ops/ms
>> MessageDigests.getAndDigest                        SHA-1        16     15  1262.656 ±   43.751  ops/ms
>> MessageDigests.getAndDigest                        SHA-1      1024     15   104.889 ±    3.554  ops/ms
>> MessageDigests.getAndDigest                      SHA-256        16     15   914.541 ±   55.621  ops/ms
>> MessageDigests.getAndDigest                      SHA-256      1024     15    85.708 ±    1.394  ops/ms
>> MessageDigests.getAndDigest                      SHA-512        16     15   737.719 ±   53.671  ops/ms
>> MessageDigests.getAndDigest                      SHA-512      1024     15   112.307 ±    1.950  ops/ms
>> 
>> GC:
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm      MD5        16     15   312.011 ±    0.005    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm    SHA-1        16     15   584.020 ±    0.006    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-256        16     15   544.019 ±    0.016    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-512        16     15  1056.037 ±    0.003    B/op
>> 
>> Target:
>> Benchmark                                 (digesterName)  (length)    Cnt     Score      Error   Units
>> MessageDigests.digest                                MD5        16     15  3134.462 ±   43.685  ops/ms
>> MessageDigests.digest                                MD5      1024     15   323.667 ±    0.633  ops/ms
>> MessageDigests.digest                              SHA-1        16     15  1418.742 ±   38.223  ops/ms
>> MessageDigests.digest                              SHA-1      1024     15   110.178 ±    0.788  ops/ms
>> MessageDigests.digest                            SHA-256        16     15  1037.949 ±   21.214  ops/ms
>> MessageDigests.digest                            SHA-256      1024     15    89.671 ±    0.228  ops/ms
>> MessageDigests.digest                            SHA-512        16     15   812.028 ±   39.489  ops/ms
>> MessageDigests.digest                            SHA-512      1024     15   116.738 ±    0.249  ops/ms
>> MessageDigests.getAndDigest                          MD5        16     15  2314.379 ±  229.294  ops/ms
>> MessageDigests.getAndDigest                          MD5      1024     15   307.835 ±    5.730  ops/ms
>> MessageDigests.getAndDigest                        SHA-1        16     15  1326.887 ±   63.263  ops/ms
>> MessageDigests.getAndDigest                        SHA-1      1024     15   106.611 ±    2.292  ops/ms
>> MessageDigests.getAndDigest                      SHA-256        16     15   961.589 ±   82.052  ops/ms
>> MessageDigests.getAndDigest                      SHA-256      1024     15    88.646 ±    0.194  ops/ms
>> MessageDigests.getAndDigest                      SHA-512        16     15   775.417 ±   56.775  ops/ms
>> MessageDigests.getAndDigest                      SHA-512      1024     15   112.904 ±    2.014  ops/ms
>> 
>> GC
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm      MD5        16     15   232.009 ±    0.006    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm    SHA-1        16     15   584.021 ±    0.001    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-256        16     15   272.012 ±    0.015    B/op
>> MessageDigests.getAndDigest:·gc.alloc.rate.norm  SHA-512        16     15   400.017 ±    0.019    B/op
>> 
>> For the `digest` micro digesting small inputs is faster with all algorithms, ranging from ~1% for SHA-512 up to ~15% for MD5. The gain stems from not allocating and reading into a temporary buffer once outside of the intrinsic. SHA-1 does not see a statistically gain because the intrinsic is disabled by default on my HW.
>> 
>> For the `getAndDigest` micro - which tests `MessageDigest.getInstance(..).digest(..)` there are similar gains with this patch. The interesting aspect here is verifying the reduction in allocations per operation when there's an active intrinsic (again, not for SHA-1). JDK-8259065 (#1933) reduced allocations on each of these with 144B/op, which means allocation pressure for SHA-512 is down two thirds from 1200B/op to 400B/op in this contrived test.
>> 
>> I've verified there are no regressions in the absence of the intrinsic - which the SHA-1 numbers here help show.
>
> src/java.base/share/classes/sun/security/provider/ByteArrayAccess.java line 214:
> 
> 
> Why do we remove the index checking from all methods? Isn't it safer to check here in case the caller didn't? Or is it such checking is already implemented inside the the various methods of VarHandle?

Yes, IOOBE checking is done by the VarHandle methods, while the Unsafe API is unsafe and needs careful precondition checking. It doesn't seem to matter for performance (interpreted code sees some benefit by the removal).

With the current usage an IOOBE is probably not observable, but there's a test that reflects into ByteArrayAccess and verifies exceptions are thrown as expected on faulty inputs.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1855