[External] : Re: Project Loom VirtualThreads hang

Wed Jan 4 17:19:56 UTC 2023

To clarify, this is not a contrived test. This is a reproduction of a real world system. With vthreads it is more efficient to remove the intermediate queues and have the producers “write through” into the consumers network connection. 

This isn’t ideal as a slow consumer can block everyone - intermediary queues isolate that problem (to a degree - the queue can still backup). The alternative is to set a write timeout and drop the consumer if it triggers. 

> On Jan 4, 2023, at 10:07 AM, Ron Pressler <ron.pressler at oracle.com> wrote:
> 
>  P.S.
> 
> To explain my hint about benchmarking with many queues, let me say that often what makes the scheduler work harder is not the context-switching itself but finding a task (in this case, a runnable thread) to run. When the amount of contention is very high compared to the total number of threads, this may be hard and require expensive inter-core chatter. But in more realistic workloads, the level of contention is significantly lower than the total number of threads, so it’s easier for the scheduler to find a thread to schedule. I.e. in common conditions with, say, 50,000 threads, they will not all be contending for the same data structure, but small groups of them may be contending over multiple data structures, and there will be sufficiently many runnable threads to keep the scheduler from working hard to find things to run on other cores’ queues.
> 
>>> On 4 Jan 2023, at 15:27, Ron Pressler <ron.pressler at oracle.com> wrote:
>>> 
>>> Thank you. 
>>> 
>>> To make your benchmark more interesting, I would suggest varying both the number of producers and consumers as well as the number of queues they contend over (e.g. 100,000 queues with 1 producer and 1 consumer each, 1000 queues with 100 producers and 100 consumers each etc.). This would also give you a sense of the kinds of benchmarks we’re using.
>>> 
>>> BTW, the impact on throughput that context switching has on message passing systems is related to the ratio between the average context switching duration and the average wait-time between messages, i.e. context-switching needs to be efficient only to the point that it is significantly smaller than the wait-time between messages. Once it’s small enough *in comparison*, reducing it further has little impact (see calculation here: https://inside.java/2020/08/07/loom-performance/).
>>> 
>>> — Ron
>>> 
>>> On 4 Jan 2023, at 04:59, robert engels <rengels at ix.netcom.com> wrote:
>>> 
>>> For some further data points. Using VthreadTest here (essentially a message passing system):
>>> 
>>> With 32 producers, 32 consumers, 500k messages on an 4/8 core machine:
>>> 
>>> 1a. native threads w ABQ: 60-65% system cpu, 20% user cpu, 15-20% idle, total time 189 seconds
>>> 1b. vthreads w ABQ: 5-10% system cpu, 75% user cpu, 15% idle, total time 63 seconds
>>> 2a. native threads w RingBuffer, spin=1: 70% system cpu, 30% user cpu, 0% idle, total time 174 seconds
>>> 2b. vthreads w RingBuffer, spin=1: 13% system cpu, 85% user, 2% idle, total time 37 seconds
>>> 3a. native threads w RingBuffer, spin=32: 68% system cpu, 30% user cpu, 2% idle, total time 164 seconds
>>> 3b. vthreads w RingBuffer, spin=32: 13% system cpu, 85% user, 3% idle, total time 40 seconds
>>> 
>>> (ABQ is stdlib ArrayBlockingQueue)
>>> 
>>> The above times have a lot of variance which is not fully accounted for but the interesting thing is that the RingBuffer makes such a huge difference between 1 & 2.
>>> 
>>> Even in 2b, there is 13% taken up by the OS - I assume due to thread switching as there is no IO in the test, which means the scheduling can probably be improved.
>>> 
>>> I would expect a green thread system to approach 0% idle and 0% system utilization in this type of test. I am “fairly cetain” the code should able to use all carrier threads 100%. Maybe the system % is going to something else? (You can use the SpinTest - comment out the println - and see that 100% cpu bound “do nothing” test that allocates no objects still uses more than 25% system cpu - which seems odd).
>>> 
>>> Here is a async profile capture of 3b:
>>> 
>>> <PastedGraphic-3.png>
>>> 
>>> Notice that the vast majority of the time is used in internal context switching.
>>> 
>>> I can “fairly agree” with the project’s stated bias towards server systems with 1000’s of threads (I do think 1000’s of threads is enough vs. millions of threads), but I hope this can be addressed moving forward. I think the CSP (communicating sequential processes) model (close to Actor model) simplifies a lot of concurrent programming concerns but it requires highly efficient context switching and queues to work well.
>>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230104/8971270e/attachment-0001.htm>