<div dir="ltr"><div></div><div><i>> An application employing 200 virtual threads will 

not perform significantly better than one employing 200 platform threads <br></i></div><div><br></div><div>It doesn't have to but we do desire it to not be materially worse. That is, it's more that we don't want to end up in conversations that go like "When we are > 1000 VT then we use Nima, JDBC, and just write simple blocking code but when we are under 200 VT we need to switch to PT and use WebFlux, R2DBC, Rx libraries, and write using callbacks and deal with errors/exceptions differently etc".</div><div><br></div><div><br></div><div><div><i>> The main difference between virtual threads and platform threads is that

 you can have many more virtual threads.</i></div><div><br></div><div>In terms of developer impact, I see it more like "With VT blocking is cheap" which then impacts fundamental library choices that developers make (server libs and db drivers) and even the style of code. As a developer we ideally want these choices to work at all the various levels of scale desired at "competitive performance" without having to change our fundamental choices (which imo is approx based on "Reactive" vs "Blocking"). <br></div></div><br><div>For myself, I'm not looking for VT to be significantly better in this sub 200 Threads range but to be competitive. Having a <i>"wide sweet spot"</i> impacts choosing a library like Nima (which does not have PT option unlike say Jetty).<br></div><div><br></div><div>FWIW: In my testing to date we can see material performance improvement via VT over PT in measuring <i>gitter</i> when load goes over a PT thread pool size resource limit. There is some argument that VT can provide "smoother scalability" behavior in this range. <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 5 Jan 2023 at 07:40, Robert Engels <<a href="mailto:rengels@ix.netcom.com">rengels@ix.netcom.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div dir="ltr"></div><div dir="ltr">The real system uses its own native buffers per connection so an OOM due to pass through isn’t possible. </div><div dir="ltr"><br><blockquote type="cite">On Jan 4, 2023, at 11:31 AM, Francesco Nigro <<a href="mailto:nigro.fra@gmail.com" target="_blank">nigro.fra@gmail.com</a>> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="ltr">> <span style="color:rgb(0,0,0)">With vthreads it is more efficient to remove the intermediate queues and have the producers “write through” into the consumers network connection. </span><div><span style="color:rgb(0,0,0)"><br></span></div><div><font color="#000000">Such producers are using "blocking" connections? If yes, beware congested network use case and lack of batching + congested networks:</font></div><div><ul><li><font color="#000000">if they batch, they will amortize syscalls but, if the network is congested, they will likely be descheduled by Loom while still holding the allocated buffer reference (that cannot be used by others): with many of them it can easily cause OOM (or worse!)</font></li><li><font color="#000000">if they are not batching and using small buffers, they will die due to syscall costs/SOFT IRQs and underutilizing the network (already congested)</font></li></ul><div><font color="#000000">The first case, to return OT, is what concern me a bit about using batching with Loom (and no, application level flow control mechanisms won't help with congested networks :/)</font></div></div><div><font color="#000000"><br></font></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno mer 4 gen 2023 alle ore 18:23 Robert Engels <<a href="mailto:rengels@ix.netcom.com" target="_blank">rengels@ix.netcom.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div dir="ltr"></div><div dir="ltr">To clarify, this is not a contrived test. This is a reproduction of a real world system. With vthreads it is more efficient to remove the intermediate queues and have the producers “write through” into the consumers network connection. </div><div dir="ltr"><br></div><div dir="ltr">This isn’t ideal as a slow consumer can block everyone - intermediary queues isolate that problem (to a degree - the queue can still backup). The alternative is to set a write timeout and drop the consumer if it triggers. </div><div dir="ltr"><br><blockquote type="cite">On Jan 4, 2023, at 10:07 AM, Ron Pressler <<a href="mailto:ron.pressler@oracle.com" target="_blank">ron.pressler@oracle.com</a>> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr">

P.S.

<div><br>

</div>

<div>To explain my hint about benchmarking with many queues, let me say that often what makes the scheduler work harder is not the context-switching itself but finding a task (in this case, a runnable thread) to run. When the amount of contention is

 very high compared to the total number of threads, this may be hard and require expensive inter-core chatter. But in more realistic workloads, the level of contention is significantly lower than the total number of threads, so it’s easier for the scheduler

 to find a thread to schedule. I.e. in common conditions with, say, 50,000 threads, they will not all be contending for the same data structure, but small groups of them may be contending over multiple data structures, and there will be sufficiently many runnable

 threads to keep the scheduler from working hard to find things to run on other cores’ queues.<br>

<div><br>

<blockquote type="cite">

<div>On 4 Jan 2023, at 15:27, Ron Pressler <<a href="mailto:ron.pressler@oracle.com" target="_blank">ron.pressler@oracle.com</a>> wrote:</div>

<br>

<div>

<div>

Thank you. 

<div><br>

</div>

<div>To make your benchmark more interesting, I would suggest varying both the number of producers and consumers as well as the number of queues they contend over (e.g. 100,000 queues with 1 producer and 1 consumer each, 1000 queues with 100 producers

 and 100 consumers each etc.). This would also give you a sense of the kinds of benchmarks we’re using.

<div><br>

</div>

<div>BTW, the impact on throughput that context switching has on message passing systems is related to the ratio between the average context switching duration and the average wait-time between messages, i.e. context-switching needs to be efficient

 only to the point that it is significantly smaller than the wait-time between messages. Once it’s small enough *in comparison*, reducing it further has little impact (see calculation here:

<a href="https://inside.java/2020/08/07/loom-performance/" target="_blank">https://inside.java/2020/08/07/loom-performance/</a>).</div>

<div><br>

</div>

<div>— Ron<br>

<div><br>

<blockquote type="cite">

<div>On 4 Jan 2023, at 04:59, robert engels <<a href="mailto:rengels@ix.netcom.com" target="_blank">rengels@ix.netcom.com</a>> wrote:</div>

<br>

<div>

<div>

<div dir="auto">

<div>

For some further data points. Using VthreadTest <a href="https://urldefense.com/v3/__https://github.com/robaho/vthread_test__;!!ACWV5N9M2RV99hQ!LFEPXV7dMYdKXhq8nPYKdugapqZ6YjxQ4-sIlUyqDBSyiNDaUEdmYFml7BXnDnv5A_d0yPXv-AoIPZanNQ$" target="_blank">here</a> (essentially

 a message passing system):

<div><br>

</div>

<div>With 32 producers, 32 consumers, 500k messages on an 4/8 core machine:</div>

<div><br>

</div>

<div>1a. native threads w ABQ: 60-65% system cpu, 20% user cpu, 15-20% idle, total time 189 seconds</div>

<div>1b. vthreads w ABQ: 5-10% system cpu, 75% user cpu, 15% idle, total time 63 seconds</div>

<div>2a. native threads w RingBuffer, spin=1: 70% system cpu, 30% user cpu, 0% idle, total time 174 seconds</div>

<div>2b. vthreads w RingBuffer, spin=1: 13% system cpu, 85% user, 2% idle, total time 37 seconds</div>

<div>

<div>3a. native threads w RingBuffer, spin=32: 68% system cpu, 30% user cpu, 2% idle, total time 164 seconds</div>

<div>3b. vthreads w RingBuffer, spin=32: 13% system cpu, 85% user, 3% idle, total time 40 seconds</div>

</div>

<div><br>

</div>

<div>(ABQ is stdlib ArrayBlockingQueue)</div>

<div><br>

</div>

<div>The above times have a lot of variance which is not fully accounted for but the interesting thing is that the RingBuffer makes such a huge difference between 1 & 2.</div>

<div><br>

</div>

<div>Even in 2b, there is 13% taken up by the OS - I assume due to thread switching as there is no IO in the test, which means the scheduling can probably be improved.</div>

<div><br>

</div>

<div>I would expect a green thread system to approach 0% idle and 0% system utilization in this type of test. I am “fairly cetain” the code should able to use all carrier threads 100%. Maybe the system % is going

 to something else? (You can use the SpinTest - comment out the println - and see that 100% cpu bound “do nothing” test that allocates no objects still uses more than 25% system cpu - which seems odd).</div>

<div><br>

</div>

<div>Here is a async profile capture of 3b:</div>

<div><br>

</div>

<div><span id="m_5762623515515026242m_-7339871472142555340cid:CE85336B-FD0B-40A9-9616-4944823FCAAC"><PastedGraphic-3.png></span></div>

<div><br>

</div>

<div>Notice that the vast majority of the time is used in internal context switching.</div>

<div><br>

</div>

<div>I can “fairly agree” with the project’s stated bias towards server systems with 1000’s of threads (I do think 1000’s of threads is enough vs. millions of threads), but I hope this can be addressed moving forward. I think the CSP (communicating

 sequential processes) model (close to Actor model) simplifies a lot of concurrent programming concerns but it requires highly efficient context switching and queues to work well.</div>

<div><br>

</div>

</div>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div>

</div>

</div>

</blockquote>

</div>

<br>

</div>

</div></blockquote></div></blockquote></div>

</div></blockquote></div></blockquote></div>