Performance of pooling virtual threads vs. semaphores

Alan Bateman Alan.Bateman at oracle.com
Thu May 30 09:47:02 UTC 2024



On 28/05/2024 23:43, Liam Miller-Cushon wrote:
> Hello,
>
> JEP 444 [1] and the docs in [2] mention that virtual threads should 
> not be pooled, and suggest semaphores as one possible alternative.
>
> My colleague Chi Wang has been investigating virtual thread 
> performance, and has found that creating one virtual thread per task 
> and synchronizing on a semaphore can result in worse performance on 
> machines with large numbers of cores.
>
> A benchmark run on a 128 core machine is included below. It submits 
> numTasks tasks to the executor determined by the strategy. The task 
> itself is mixed with CPU and I/O work (simulated by fibonacci and 
> sleep). The parallelism is set to 600 for all strategies.
>
> * Strategy 1 is the baseline where it just submits all the tasks to a 
> ForkJoinPool whose pool size is 600.
> * Strategy 2 uses the method suggested by JEP 444.
> * Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.
>
> Note that, with 100K and 1M tasks, strategy 2 has a CPU time 
> regression that seems to increase with the number of cores. This 
> result can be reproduced on the real-world program that is being 
> migrated to a virtual-thread-per-task model.
>
> Diffing the cpu profile between strategy 1 and strategy 2 showed that 
> most of the CPU regression comes from method 
> `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue, 
> long, int)`.

If the concurrency for the virtual thread run is limited to the same 
value as the thread count in the thread pool runs then you are unlikely 
to see benefit. The increased CPU time probably isn't too surprising 
either. In the two runs with threads then the N task are queued once. In 
the virtual thread run then the tasks for the N virtual threads may be 
queued up to 4 times, one for the initial submit, one waiting for 
semaphore permit, and twice for the two sleeps. Also when CPU 
utilization is low (as I assume it is here) then the FJP scan does tend 
up to show up in profiles.

Has Chi looked into increasing the concurrency so that it's not limited 
to 600? Concurrency may need limited at finer grain the "real world 
program", but may not the number of threads.

-Alan



More information about the loom-dev mailing list