RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access

Fri May 26 00:20:03 UTC 2023

On Wed, 24 May 2023 19:36:44 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> UUID is the very important class that is used to track identities of objects in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% of total CPU time, and is frequently a scalability bottleneck due to `SecureRandom` synchronization.
> 
> The major issue with UUID code itself is that it reads from the single `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` is bashed with very small requests. This also has a chilling effect on other users of `SecureRandom`, when there is a heavy UUID generation traffic.
> 
> We can improve this by doing the bulk reads from the backing SecureRandom and possibly striping the reads across many instances of it. 
> 
> 
> Benchmark               Mode  Cnt  Score   Error   Units
> 
> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
> UUIDRandomBench.max     thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
> 
> # After
> UUIDRandomBench.single  thrpt   15  4.421 ± 0.047  ops/us 
> UUIDRandomBench.max     thrpt   15  6.658 ± 0.092  ops/us ; positive scaling, ~1.5x
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before
> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
> UUIDRandomBench.max     thrpt   15  1.880 ± 0.029  ops/us  ; negative scaling 
> 
> # After
> Benchmark                Mode  Cnt  Score   Error   Units
> UUIDRandomBench.single  thrpt   15  3.099 ± 0.022  ops/us
> UUIDRandomBench.max     thrpt   15  3.555 ± 0.062  ops/us  ; positive scaling, ~1.2x
> 
> 
> Note that there is still a scalability bottleneck in current default random (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure out the synchronization story there. The scalability fix in current default `SecureRandom` would be much more intrusive and risky, since it would change a core crypto class with unknown bug fanout.
> 
> Using the bulk reads even when the underlying PRNG is heavily synchronized is still a win. A more scalable PRNG would benefit from this as well. This PR adds a system property to select the PRNG implementation, and there we can clearly see the benefit with more scalable PRNG sources:
> 
> 
> Benchmark               Mode  Cnt   Score   Error   Units
> 
> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
> 
> # Before, hacked `new SecureRandom()` to `SecureRandom.getInstance("SHA1PRNG")`
> UUIDRandomBench.single  thrpt   15  3.661 ± 0.008  ops/us
> UUIDRandomBench...

src/java.base/share/classes/java/util/UUID.java line 224:

> 222:                     // Try to pull the UUID from the current buffer at current position.
> 223:                     if (stamp != 0) {
> 224:                         int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK);

An atomic update together with an optimistic lock looks non-idiomatic use of StampedLock to me. Won't a simple CAS loop be simpler? E.g. in pseudocode:

while ((p = this.pos) + 16) < buf.length) {
    long msb = getLong(buf, p);
    long lsb = getLong(buf, p + 8);
    if (cas(this.pos, p, p + 16)) {
        return new UUID(msb, lsb);
    }
}

// refill buffer under lock

src/java.base/share/classes/java/util/UUID.java line 226:

> 224:                         int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK);
> 225:                         if (p < BUF_SIZE) {
> 226:                             uuid = new UUID(buf, p);

We can read msb/lsb from the buffer here and move object allocation outside the lock to reduce the length of the critical section.

src/java.base/share/classes/java/util/UUID.java line 260:

> 258:                         buf[c + 8] &= 0x3f;  /* clear variant        */
> 259:                         buf[c + 8] |= (byte) 0x80;  /* set to IETF variant  */
> 260:                     }

I'm not sure I understand the point about initialization. Why not just setting the required version bits right in UUID constructor without updating the array? This will be faster and simpler IMO.

src/java.base/share/classes/java/util/UUID.java line 286:

> 284:         long lsb = 0;
> 285:         for (int i = start; i < start + 8; i++) {
> 286:             msb = (msb << 8) | (data[i] & 0xff);

Can we use VarHandle/ByteBuffer to read 64 bits at a time?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206104940
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206105877
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206096052
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206097261