: Re: Project Loom VirtualThreads hang

Arnaud Masson arnaud.masson at fr.ibm.com
Fri Jan 6 16:56:10 UTC 2023


My scenario is not when you have ten thousands of CPU bound tasks.
The scenario I’m talking about is when you have thousands of IO bound tasks and just enough CPU-bound tasks (in the same scheduler) to pin most of the carrier threads for some time.

When I talk about CPU-bound task here, it’s the worst case when there is not even a yield() or random small IO so it won’t allow to switch the carrier thread until fully completed if I understand correctly.

Example:
A webapp with a Loom scheduler with 4 native threads.
The server gets continuously on average 100 concurrent IO-bound requests: ok
Then it gets a one-shot group of 4 CPU-bound requests, pure CPU (no yield) stuff taking 10 secs each.
Won’t the 4 carrier threads be sticking to the 4 CPU-bounds requests preventing progress on the other IO-bound requests for 10 secs, then resume work on IO-bound requests?

(So not very different from your time sharing example: timesharing on the minority / CPU-bounds tasks would maintain responsiveness for the majority of IO-bound.)

Thanks
Arnaud


Why are you concerned about a problem no one has reported yet? As I said, we are very interested in fixing problems, but we don’t know how to fix non-problems. If someone shows us a problem — a server that misbehaves under some loads — we’ll try to address it. Even if there is a problem with scheduling, and even if time-sharing could solve it, we still don’t know what kind of time-sharing algorithm to employ until we see the actual problem, so there’s nothing we can do about it until we know what it actually is.

It’s not that I’m resistant to adding time-sharing. Far from it. But until we have a problem-reproducer that we can test and mark as fixed, we really can’t tell if the time-sharing that we were to introduce today is the time-sharing that would solve the problem we are yet to see.


Of course generally on web apps, most requests are IO-bound (http, jdbc) but I do have seen CPU-bound requests on prod (sometimes accidental).

Have you encountered a problem with virtual threads in servers with occasional CPU-bound requests?



I don’t use Loom on prod today (I guess not a lot of people do since it’s still preview), so if you are asking if I see a production issue, answer is no.

Have you seen a problem with a server misbehaving in testing, then?

You need to understand that I’m not teasing you. I actually want people to report problems so that we could make the product better by fixing them. But merely saying that there could be a problem and that some sketch of a solution could address it doesn’t give us anything actionable that would help us improve the product.



I suppose if I want to migrate to loom and be safe, I can increase the number of native carriers in the underlying pool (N >> core count).
It’s just that if there was timesharing in Loom, I don’t see why vthreads would not be systematically used (almost blindly, for CPU-bound and IO-bound).

But servers based on thread pools have *worse* fairness than virtual threads: they don’t share as well even when not at 100% CPU. I don’t understand why you think more workers would make you safer when you’re not sure whether or not time-sharing helps server workloads at all.



I’m curious, when do you think preemptive time sharing (as implemented in various OSes for decades) is useful?

Time-sharing in *non* realtime kernels is crucial useful to keep a system responsive to operator interaction in the presence of a few background tasks that can saturate the CPU; without it, the operator isn’t even able to terminate resource-hungry processes (as would happen in early Windows versions). But transaction processing in servers is different. You have tens of thousands of tasks, if they consume a significant amount CPU then you’re overcommitted by orders of magnitude — indeed, OS time sharing is not able to make servers well-behaved at 100% CPU — but responsiveness to operator intervention is still preserved thanks to the OS time-sharing.

— Ron

Unless otherwise stated above:

Compagnie IBM France
Siège Social : 17, avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 664 069 390,60 €
SIRET : 552 118 465 03644 - Code NAF 6203Z
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230106/a474bb77/attachment.htm>


More information about the loom-dev mailing list