RFR: 8259498: Reduce overhead of MD5 and SHA digests

Fri Jan 8 22:41:06 UTC 2021

On Thu, 7 Jan 2021 14:45:03 GMT, Claes Redestad <redestad at openjdk.org> wrote:

>> I've identified a number of optimizations to the plumbing behind `MessageDigest.getDigest(..)` over in #1933 that removes 80-90% of the throughput overhead and all the allocation overhead compared to the `clone()` approach prototyped here. The remaining 20ns/op overhead might not be enough of a concern to do a point fix in `UUID::nameUUIDFromBytes`.
>
> Removing the UUID clone cache and running the microbenchmark along with the changes in #1933:
> 
> Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
> UUIDBench.fromType3Bytes                                    20000  thrpt   12    2.182 ±  0.090  ops/us
> UUIDBench.fromType3Bytes:·gc.alloc.rate                     20000  thrpt   12  439.020 ± 18.241  MB/sec
> UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  264.022 ±  0.003    B/op
> 
> The goal now is if to simplify the digest code and compare alternatives.

I've run various tests and concluded that the `VarHandle`ized code is matching or improving upon the `Unsafe`-riddled code in `ByteArrayAccess`. I then went ahead and consolidated to use similar code pattern in `ByteArrayAccess` for consistency, which amounts to a good cleanup.

With MD5 intrinsics disabled, I get this baseline:

Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
UUIDBench.fromType3Bytes                                    20000  thrpt   12    1.245 ±  0.077  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  488.042 ±  0.004    B/op

With the current patch here (not including #1933): 
Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
UUIDBench.fromType3Bytes                                    20000  thrpt   12    1.431 ±  0.106  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  408.035 ±  0.006    B/op

If I isolate the `ByteArrayAccess` changes I'm getting performance neutral or slightly better numbers compared to baseline for these tests:

Benchmark                                                  (size)   Mode  Cnt    Score    Error   Units
UUIDBench.fromType3Bytes                                    20000  thrpt   12    1.317 ±  0.092  ops/us
UUIDBench.fromType3Bytes:·gc.alloc.rate.norm                20000  thrpt   12  488.042 ±  0.004    B/op

-------------

PR: https://git.openjdk.java.net/jdk/pull/1855