Performance of pooling virtual threads vs. semaphores

Thu May 30 18:22:41 UTC 2024

> On 28 May 2024, at 23:43, Liam Miller-Cushon <cushon at google.com> wrote:
> 
> Hello,
> 
> JEP 444 [1] and the docs in [2] mention that virtual threads should not be pooled, and suggest semaphores as one possible alternative.
> 
> My colleague Chi Wang has been investigating virtual thread performance, and has found that creating one virtual thread per task and synchronizing on a semaphore can result in worse performance on machines with large numbers of cores.
> 
> A benchmark run on a 128 core machine is included below. It submits numTasks tasks to the executor determined by the strategy. The task itself is mixed with CPU and I/O work (simulated by fibonacci and sleep). The parallelism is set to 600 for all strategies.
> 
> * Strategy 1 is the baseline where it just submits all the tasks to a ForkJoinPool whose pool size is 600.
> * Strategy 2 uses the method suggested by JEP 444.
> * Strategy 3 uses a fixed thread pool of 600 backed by virtual threads.
> 
> Note that, with 100K and 1M tasks, strategy 2 has a CPU time regression that seems to increase with the number of cores. This result can be reproduced on the real-world program that is being migrated to a virtual-thread-per-task model.
> 
> Diffing the cpu profile between strategy 1 and strategy 2 showed that most of the CPU regression comes from method `java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue, long, int)`.
> 
> Are there any ideas for why the semaphore strategy uses more CPU than pooling virtual threads on machines with a large number of cores?

My first guess would be that what’s being compared here is not the same algorithm. In strategy 2, you spawn a bunch of different threads that do a tiny bit of work before starting the task, while in the other strategies the algorithm is different.

In a realistic workload you will have a number of threads serving requests, and then they may want to access a concurrency-limited resource. I think you can try to simulate that by simulating, say, 1000 tasks running in some large thread pool and then each of them would submit one subtasks to a fixed thread pool of size 10. For virtual threads, have 1000 virtual threads that try to acquire a semaphore with 10 leases.

Can you try that? At least then we’ll be comparing oranges to oranges.

— Ron