[External] : Re: Project Loom VirtualThreads hang

Thu Jan 5 01:01:03 UTC 2023

Understood. Thanks.

For a large swath of “message systems” you are talking about hundreds of clients, not thousands, and certainly not millions. I am particularly referring to financial/trading systems.

These usually use a fan-out structure so any given node only has < 100 clients and often far smaller than that. Latency is a huge concern however. This can be done with async IO and native threads but it doesn’t “feel like Java” - and is certainly harder to understand/maintain in my opinion.

If Loom allowed Go like performance it would be a killer solution for these “router” type nodes.

But please don’t take this as a strong criticism -  in my testing Loom is within 10% of Go performance as it is today - which is a remarkable achievement especially given the backwards compatibility/programming model.

> On Jan 4, 2023, at 9:27 AM, Ron Pressler <ron.pressler at oracle.com> wrote:
> 
> Thank you. 
> 
> To make your benchmark more interesting, I would suggest varying both the number of producers and consumers as well as the number of queues they contend over (e.g. 100,000 queues with 1 producer and 1 consumer each, 1000 queues with 100 producers and 100 consumers each etc.). This would also give you a sense of the kinds of benchmarks we’re using.
> 
> BTW, the impact on throughput that context switching has on message passing systems is related to the ratio between the average context switching duration and the average wait-time between messages, i.e. context-switching needs to be efficient only to the point that it is significantly smaller than the wait-time between messages. Once it’s small enough *in comparison*, reducing it further has little impact (see calculation here: https://inside.java/2020/08/07/loom-performance/ <https://inside.java/2020/08/07/loom-performance/>).
> 
> — Ron
> 
>> On 4 Jan 2023, at 04:59, robert engels <rengels at ix.netcom.com <mailto:rengels at ix.netcom.com>> wrote:
>> 
>> For some further data points. Using VthreadTest here <https://urldefense.com/v3/__https://github.com/robaho/vthread_test__;!!ACWV5N9M2RV99hQ!LFEPXV7dMYdKXhq8nPYKdugapqZ6YjxQ4-sIlUyqDBSyiNDaUEdmYFml7BXnDnv5A_d0yPXv-AoIPZanNQ$> (essentially a message passing system):
>> 
>> With 32 producers, 32 consumers, 500k messages on an 4/8 core machine:
>> 
>> 1a. native threads w ABQ: 60-65% system cpu, 20% user cpu, 15-20% idle, total time 189 seconds
>> 1b. vthreads w ABQ: 5-10% system cpu, 75% user cpu, 15% idle, total time 63 seconds
>> 2a. native threads w RingBuffer, spin=1: 70% system cpu, 30% user cpu, 0% idle, total time 174 seconds
>> 2b. vthreads w RingBuffer, spin=1: 13% system cpu, 85% user, 2% idle, total time 37 seconds
>> 3a. native threads w RingBuffer, spin=32: 68% system cpu, 30% user cpu, 2% idle, total time 164 seconds
>> 3b. vthreads w RingBuffer, spin=32: 13% system cpu, 85% user, 3% idle, total time 40 seconds
>> 
>> (ABQ is stdlib ArrayBlockingQueue)
>> 
>> The above times have a lot of variance which is not fully accounted for but the interesting thing is that the RingBuffer makes such a huge difference between 1 & 2.
>> 
>> Even in 2b, there is 13% taken up by the OS - I assume due to thread switching as there is no IO in the test, which means the scheduling can probably be improved.
>> 
>> I would expect a green thread system to approach 0% idle and 0% system utilization in this type of test. I am “fairly cetain” the code should able to use all carrier threads 100%. Maybe the system % is going to something else? (You can use the SpinTest - comment out the println - and see that 100% cpu bound “do nothing” test that allocates no objects still uses more than 25% system cpu - which seems odd).
>> 
>> Here is a async profile capture of 3b:
>> 
>> <PastedGraphic-3.png>
>> 
>> Notice that the vast majority of the time is used in internal context switching.
>> 
>> I can “fairly agree” with the project’s stated bias towards server systems with 1000’s of threads (I do think 1000’s of threads is enough vs. millions of threads), but I hope this can be addressed moving forward. I think the CSP (communicating sequential processes) model (close to Actor model) simplifies a lot of concurrent programming concerns but it requires highly efficient context switching and queues to work well.
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230104/54a80407/attachment.htm>