RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access

Aleksey Shipilev shade at openjdk.org
Fri May 26 08:42:57 UTC 2023


On Fri, 26 May 2023 00:13:57 GMT, Andrei Pangin <apangin at openjdk.org> wrote:

>> UUID is the very important class that is used to track identities of objects in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% of total CPU time, and is frequently a scalability bottleneck due to `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` is bashed with very small requests. This also has a chilling effect on other users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark               Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max     thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.421 ± 0.047  ops/us 
>> UUIDRandomBench.max     thrpt   15  6.658 ± 0.092  ops/us ; positive scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max     thrpt   15  1.880 ± 0.029  ops/us  ; negative scaling 
>> 
>> # After
>> Benchmark                Mode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.099 ± 0.022  ops/us
>> UUIDRandomBench.max     thrpt   15  3.555 ± 0.062  ops/us  ; positive scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure out the synchronization story there. The scalability fix in current default `SecureRandom` would be much more intrusive and risky, since it would change a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized is still a win. A more scalable PRNG would benefit from this as well. This PR adds a system property to select the PRNG implementation, and there we can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark               Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> src/java.base/share/classes/java/util/UUID.java line 224:
> 
>> 222:                     // Try to pull the UUID from the current buffer at current position.
>> 223:                     if (stamp != 0) {
>> 224:                         int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK);
> 
> An atomic update together with an optimistic lock looks non-idiomatic use of StampedLock to me. Won't a simple CAS loop be simpler? E.g. in pseudocode:
> 
> while ((p = this.pos) + 16) < buf.length) {
>     long msb = getLong(buf, p);
>     long lsb = getLong(buf, p + 8);
>     if (cas(this.pos, p, p + 16)) {
>         return new UUID(msb, lsb);
>     }
> }
> 
> // refill buffer under lock

Nope, I don't think you cannot do this, because there is a race on concurrently-updating `buf`. The StampedLock (RW lock), protects the buffer-reading threads from seeing the `buf` undergoing the initialization by buffer-writing threads. Atomic pos only arbitrates the concurrent buffer-threads. That atomic does not help to resolve the buf races.

> src/java.base/share/classes/java/util/UUID.java line 260:
> 
>> 258:                         buf[c + 8] &= 0x3f;  /* clear variant        */
>> 259:                         buf[c + 8] |= (byte) 0x80;  /* set to IETF variant  */
>> 260:                     }
> 
> I'm not sure I understand the point about initialization. Why not just setting the required version bits right in UUID constructor without updating the array? This will be faster and simpler IMO.

Right, we can do things straight on locals, without ever touching the heap, let's see...

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206430447
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206433588


More information about the core-libs-dev mailing list