RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access

Thu May 25 11:40:28 UTC 2023

UUID is the very important class that is used to track identities of objects in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% of total CPU time, and is frequently a scalability bottleneck due to `SecureRandom` synchronization.

The major issue with UUID code itself is that it reads from the single `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` is bashed with very small requests. This also has a chilling effect on other users of `SecureRandom`, when there is a heavy UUID generation traffic.

We can improve this by doing the bulk reads from the backing SecureRandom and possibly striping the reads across many instances of it. 

Benchmark               Mode  Cnt  Score   Error   Units

### AArch64 (m6g.4xlarge, Graviton, 16 cores)

# Before
UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
UUIDRandomBench.max     thrpt   15  1.832 ± 0.059  ops/us ; negative scaling

# After
UUIDRandomBench.single  thrpt   15  4.421 ± 0.047  ops/us 
UUIDRandomBench.max     thrpt   15  6.658 ± 0.092  ops/us ; positive scaling, ~1.5x

### x86_64 (c6.8xlarge, Xeon, 18 cores)

# Before
UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
UUIDRandomBench.max     thrpt   15  1.880 ± 0.029  ops/us  ; negative scaling 

# After
Benchmark                Mode  Cnt  Score   Error   Units
UUIDRandomBench.single  thrpt   15  3.099 ± 0.022  ops/us
UUIDRandomBench.max     thrpt   15  3.555 ± 0.062  ops/us  ; positive scaling, ~1.2x

Note that there is still a scalability bottleneck in current default random (`NativePRNG`), because it synchronizes over a singleton instance. This PR adds a system property to select the implementation, and there we can clearly see the benefit:

Benchmark               Mode  Cnt   Score   Error   Units

### x86_64 (c6.8xlarge, Xeon, 18 cores)

# Before, hacked `new SecureRandom()` to `SecureRandom.getInstance("SHA1PRNG")`
UUIDRandomBench.single  thrpt   15  3.661 ± 0.008  ops/us
UUIDRandomBench.max     thrpt   15  2.400 ± 0.031  ops/us  ; faster than NativePRNG, but still negative scalability

# After, -Djava.util.UUID.prngName=SHA1PRNG
UUIDRandomBench.single  thrpt   15   3.522 ± 0.009  ops/us
UUIDRandomBench.max     thrpt   15  50.506 ± 1.734  ops/us ;  positive scaling, ~14x

Other scalable random number providers would improve the similar way. Note that just changing to`SHA1PRNG` right now would not help much, because it would still be very contended. This PR does not change the default PRNG provider, that would need a larger discussion. It only provides the means to select another one.

Since the buffers are allocated on-demand and stay permanently, there are allocation rate improvements: generating an UUID now takes 80 bytes per op instead of 120 bytes per op. The buffer cache also takes memory. Back-envelope: for large 192-core machine that takes UUIDs in all threads, the default settings add up to 768K of additional memory.

Additional testing:
 - [x] Updated tests from #14134 
 - [ ] Linux AArch64 fastdebug `tier1 tier2 tier3`

The new options are not strictly speaking required for this work to be useful, but it would be convenient to have them around for field tuning and diagnostics.

-------------

Commit messages:
 - More touchups
 - Comment updates
 - Runtime options and touchups
 - Add benchmark
 - Initial work

Changes: https://git.openjdk.org/jdk/pull/14135/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14135&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8308804
  Stats: 237 lines in 2 files changed: 221 ins; 9 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/14135.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14135/head:pull/14135

PR: https://git.openjdk.org/jdk/pull/14135