Performance of pooling virtual threads vs. semaphores
Alan Bateman
Alan.Bateman at oracle.com
Thu May 30 09:47:02 UTC 2024
On 28/05/2024 23:43, Liam Miller-Cushon wrote:
> Hello,
>
> JEP 444 [1] and the docs in [2] mention that virtual threads should
> not be pooled, and suggest semaphores as one possible alternative.
>
> My colleague Chi Wang has been investigating virtual thread
> performance, and has found that creating one virtual thread per task
> and synchronizing on a semaphore can result in worse performance on
> machines with large numbers of cores.
>
> A benchmark run on a 128 core machine is included below. It submits
> numTasks tasks to the executor determined by the strategy. The task
> itself is mixed with CPU and I/O work (simulated by fibonacci and
> sleep). The parallelism is set to 600 for all strategies.
>
> * Strategy 1 is the baseline where it just submits all the tasks to a
> ForkJoinPool whose pool size is 600.
> * Strategy 2 uses the method suggested by JEP 444.
> * Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.
>
> Note that, with 100K and 1M tasks, strategy 2 has a CPU time
> regression that seems to increase with the number of cores. This
> result can be reproduced on the real-world program that is being
> migrated to a virtual-thread-per-task model.
>
> Diffing the cpu profile between strategy 1 and strategy 2 showed that
> most of the CPU regression comes from method
> `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue,
> long, int)`.
If the concurrency for the virtual thread run is limited to the same
value as the thread count in the thread pool runs then you are unlikely
to see benefit. The increased CPU time probably isn't too surprising
either. In the two runs with threads then the N task are queued once. In
the virtual thread run then the tasks for the N virtual threads may be
queued up to 4 times, one for the initial submit, one waiting for
semaphore permit, and twice for the two sleeps. Also when CPU
utilization is low (as I assume it is here) then the FJP scan does tend
up to show up in profiles.
Has Chi looked into increasing the concurrency so that it's not limited
to 600? Concurrency may need limited at finer grain the "real world
program", but may not the number of threads.
-Alan
More information about the loom-dev
mailing list