[External] : Re: Project Loom VirtualThreads hang

Wed Jan 4 23:07:42 UTC 2023

*> An application employing 200 virtual threads will not perform
significantly better than one employing 200 platform threads *

It doesn't have to but we do desire it to not be materially worse. That is,
it's more that we don't want to end up in conversations that go like "When
we are > 1000 VT then we use Nima, JDBC, and just write simple blocking
code but when we are under 200 VT we need to switch to PT and use WebFlux,
R2DBC, Rx libraries, and write using callbacks and deal with
errors/exceptions differently etc".

*> The main difference between virtual threads and platform threads is that
you can have many more virtual threads.*

In terms of developer impact, I see it more like "With VT blocking is
cheap" which then impacts fundamental library choices that developers make
(server libs and db drivers) and even the style of code. As a developer we
ideally want these choices to work at all the various levels of scale
desired at "competitive performance" without having to change our
fundamental choices (which imo is approx based on "Reactive" vs
"Blocking").

For myself, I'm not looking for VT to be significantly better in this sub
200 Threads range but to be competitive. Having a *"wide sweet spot"*
impacts choosing a library like Nima (which does not have PT option unlike
say Jetty).

FWIW: In my testing to date we can see material performance improvement via
VT over PT in measuring *gitter* when load goes over a PT thread pool size
resource limit. There is some argument that VT can provide "smoother
scalability" behavior in this range.

On Thu, 5 Jan 2023 at 07:40, Robert Engels <rengels at ix.netcom.com> wrote:

> The real system uses its own native buffers per connection so an OOM due
> to pass through isn’t possible.
>
> On Jan 4, 2023, at 11:31 AM, Francesco Nigro <nigro.fra at gmail.com> wrote:
>
> 
> > With vthreads it is more efficient to remove the intermediate queues
> and have the producers “write through” into the consumers network
> connection.
>
> Such producers are using "blocking" connections? If yes, beware
> congested network use case and lack of batching + congested networks:
>
>    - if they batch, they will amortize syscalls but, if the network is
>    congested, they will likely be descheduled by Loom while still holding the
>    allocated buffer reference (that cannot be used by others): with many of
>    them it can easily cause OOM (or worse!)
>    - if they are not batching and using small buffers, they will die due
>    to syscall costs/SOFT IRQs and underutilizing the network (already
>    congested)
>
> The first case, to return OT, is what concern me a bit about using
> batching with Loom (and no, application level flow control mechanisms won't
> help with congested networks :/)
>
>
> Il giorno mer 4 gen 2023 alle ore 18:23 Robert Engels <
> rengels at ix.netcom.com> ha scritto:
>
>> To clarify, this is not a contrived test. This is a reproduction of a
>> real world system. With vthreads it is more efficient to remove the
>> intermediate queues and have the producers “write through” into the
>> consumers network connection.
>>
>> This isn’t ideal as a slow consumer can block everyone - intermediary
>> queues isolate that problem (to a degree - the queue can still backup). The
>> alternative is to set a write timeout and drop the consumer if it triggers.
>>
>> On Jan 4, 2023, at 10:07 AM, Ron Pressler <ron.pressler at oracle.com>
>> wrote:
>>
>>  P.S.
>>
>> To explain my hint about benchmarking with many queues, let me say that
>> often what makes the scheduler work harder is not the context-switching
>> itself but finding a task (in this case, a runnable thread) to run. When
>> the amount of contention is very high compared to the total number of
>> threads, this may be hard and require expensive inter-core chatter. But in
>> more realistic workloads, the level of contention is significantly lower
>> than the total number of threads, so it’s easier for the scheduler to find
>> a thread to schedule. I.e. in common conditions with, say, 50,000 threads,
>> they will not all be contending for the same data structure, but small
>> groups of them may be contending over multiple data structures, and there
>> will be sufficiently many runnable threads to keep the scheduler from
>> working hard to find things to run on other cores’ queues.
>>
>> On 4 Jan 2023, at 15:27, Ron Pressler <ron.pressler at oracle.com> wrote:
>>
>> Thank you.
>>
>> To make your benchmark more interesting, I would suggest varying both the
>> number of producers and consumers as well as the number of queues they
>> contend over (e.g. 100,000 queues with 1 producer and 1 consumer each, 1000
>> queues with 100 producers and 100 consumers each etc.). This would also
>> give you a sense of the kinds of benchmarks we’re using.
>>
>> BTW, the impact on throughput that context switching has on message
>> passing systems is related to the ratio between the average context
>> switching duration and the average wait-time between messages, i.e.
>> context-switching needs to be efficient only to the point that it is
>> significantly smaller than the wait-time between messages. Once it’s small
>> enough *in comparison*, reducing it further has little impact (see
>> calculation here: https://inside.java/2020/08/07/loom-performance/).
>>
>> — Ron
>>
>> On 4 Jan 2023, at 04:59, robert engels <rengels at ix.netcom.com> wrote:
>>
>> For some further data points. Using VthreadTest here
>> <https://urldefense.com/v3/__https://github.com/robaho/vthread_test__;!!ACWV5N9M2RV99hQ!LFEPXV7dMYdKXhq8nPYKdugapqZ6YjxQ4-sIlUyqDBSyiNDaUEdmYFml7BXnDnv5A_d0yPXv-AoIPZanNQ$> (essentially
>> a message passing system):
>>
>> With 32 producers, 32 consumers, 500k messages on an 4/8 core machine:
>>
>> 1a. native threads w ABQ: 60-65% system cpu, 20% user cpu, 15-20% idle,
>> total time 189 seconds
>> 1b. vthreads w ABQ: 5-10% system cpu, 75% user cpu, 15% idle, total time
>> 63 seconds
>> 2a. native threads w RingBuffer, spin=1: 70% system cpu, 30% user cpu, 0%
>> idle, total time 174 seconds
>> 2b. vthreads w RingBuffer, spin=1: 13% system cpu, 85% user, 2% idle,
>> total time 37 seconds
>> 3a. native threads w RingBuffer, spin=32: 68% system cpu, 30% user cpu,
>> 2% idle, total time 164 seconds
>> 3b. vthreads w RingBuffer, spin=32: 13% system cpu, 85% user, 3% idle,
>> total time 40 seconds
>>
>> (ABQ is stdlib ArrayBlockingQueue)
>>
>> The above times have a lot of variance which is not fully accounted for
>> but the interesting thing is that the RingBuffer makes such a huge
>> difference between 1 & 2.
>>
>> Even in 2b, there is 13% taken up by the OS - I assume due to thread
>> switching as there is no IO in the test, which means the scheduling can
>> probably be improved.
>>
>> I would expect a green thread system to approach 0% idle and 0% system
>> utilization in this type of test. I am “fairly cetain” the code should able
>> to use all carrier threads 100%. Maybe the system % is going to something
>> else? (You can use the SpinTest - comment out the println - and see that
>> 100% cpu bound “do nothing” test that allocates no objects still uses more
>> than 25% system cpu - which seems odd).
>>
>> Here is a async profile capture of 3b:
>>
>> <PastedGraphic-3.png>
>>
>> Notice that the vast majority of the time is used in internal context
>> switching.
>>
>> I can “fairly agree” with the project’s stated bias towards server
>> systems with 1000’s of threads (I do think 1000’s of threads is enough vs.
>> millions of threads), but I hope this can be addressed moving forward. I
>> think the CSP (communicating sequential processes) model (close to Actor
>> model) simplifies a lot of concurrent programming concerns but it requires
>> highly efficient context switching and queues to work well.
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230105/2de3f413/attachment-0001.htm>