: Re: Project Loom VirtualThreads hang

Fri Jan 6 15:50:00 UTC 2023

On 6 Jan 2023, at 14:39, Arnaud Masson <arnaud.masson at fr.ibm.com<mailto:arnaud.masson at fr.ibm.com>> wrote:

Ron,

My concern is that Loom (in its current form) greatly improves scalability on IO-bound requests and but introduce a new fairness problem regarding CPU-bound request.

Why are you concerned about a problem no one has reported yet? As I said, we are very interested in fixing problems, but we don’t know how to fix non-problems. If someone shows us a problem — a server that misbehaves under some loads — we’ll try to address it. Even if there is a problem with scheduling, and even if time-sharing could solve it, we still don’t know what kind of time-sharing algorithm to employ until we see the actual problem, so there’s nothing we can do about it until we know what it actually is.

It’s not that I’m resistant to adding time-sharing. Far from it. But until we have a problem-reproducer that we can test and mark as fixed, we really can’t tell if the time-sharing that we were to introduce today is the time-sharing that would solve the problem we are yet to see.

Of course generally on web apps, most requests are IO-bound (http, jdbc) but I do have seen CPU-bound requests on prod (sometimes accidental).

Have you encountered a problem with virtual threads in servers with occasional CPU-bound requests?

I don’t use Loom on prod today (I guess not a lot of people do since it’s still preview), so if you are asking if I see a production issue, answer is no.

Have you seen a problem with a server misbehaving in testing, then?

You need to understand that I’m not teasing you. I actually want people to report problems so that we could make the product better by fixing them. But merely saying that there could be a problem and that some sketch of a solution could address it doesn’t give us anything actionable that would help us improve the product.

I suppose if I want to migrate to loom and be safe, I can increase the number of native carriers in the underlying pool (N >> core count).
It’s just that if there was timesharing in Loom, I don’t see why vthreads would not be systematically used (almost blindly, for CPU-bound and IO-bound).

But servers based on thread pools have *worse* fairness than virtual threads: they don’t share as well even when not at 100% CPU. I don’t understand why you think more workers would make you safer when you’re not sure whether or not time-sharing helps server workloads at all.

I’m curious, when do you think preemptive time sharing (as implemented in various OSes for decades) is useful?

Time-sharing in *non* realtime kernels is crucial useful to keep a system responsive to operator interaction in the presence of a few background tasks that can saturate the CPU; without it, the operator isn’t even able to terminate resource-hungry processes (as would happen in early Windows versions). But transaction processing in servers is different. You have tens of thousands of tasks, if they consume a significant amount CPU then you’re overcommitted by orders of magnitude — indeed, OS time sharing is not able to make servers well-behaved at 100% CPU — but responsiveness to operator intervention is still preserved thanks to the OS time-sharing.

— Ron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230106/1474e409/attachment-0001.htm>