: Re: Project Loom VirtualThreads hang

Fri Jan 6 19:11:55 UTC 2023

I don’t think having 100% CPU usage on a pod is enough to justify a “stop-the-world” effect on Loom scheduling for the other tasks.
Also 100% is the extreme case, but there can be 75% CPU usage, meaning only 1 carrier left for all other tasks in my example.

Again not a blocker I guess, just have to increase the carrier count to mitigate, but it’s good old native thread sizing again where it should not be really needed.

“Time-sharing would make those expensive tasks complete in a lot more than 10 seconds”:
I understand there would be switching overhead (so it’s slower), but I don’t understand why it would be much slower if there are few of them like in my example.

thanks
Arnaud

On 6 Jan 2023, at 16:56, Arnaud Masson <arnaud.masson at fr.ibm.com<mailto:arnaud.masson at fr.ibm.com>> wrote:

My scenario is not when you have ten thousands of CPU bound tasks.
The scenario I’m talking about is when you have thousands of IO bound tasks and just enough CPU-bound tasks (in the same scheduler) to pin most of the carrier threads for some time.

That scenario still doesn’t amount to a problem, let alone we can investigate and solve and know we’ve solved. There are many reasons why this hypothetical scenario is unlikely to arise in practice, and/or that if it does, time-sharing won’t be able to help.

When I talk about CPU-bound task here, it’s the worst case when there is not even a yield() or random small IO so it won’t allow to switch the carrier thread until fully completed if I understand correctly.

Example:
A webapp with a Loom scheduler with 4 native threads.
The server gets continuously on average 100 concurrent IO-bound requests: ok
Then it gets a one-shot group of 4 CPU-bound requests, pure CPU (no yield) stuff taking 10 secs each.
Won’t the 4 carrier threads be sticking to the 4 CPU-bounds requests preventing progress on the other IO-bound requests for 10 secs, then resume work on IO-bound requests?

The issue here is how is such a server expected to work in the first place? Because of all the short tasks, time-sharing would make those expensive tasks complete in a lot more than 10 seconds, and how is it that the careful balance between the task types is maintained to ensure such a server could work? For time-sharing to improve matters, you’d need just a high enough number of big CPU-bound tasks to saturate the CPU but low enough so that those tasks don’t end up overwhelming the server.

A server of the kind you’d describing would occasionally but consistently reach 100% CPU for at least 10 seconds even if platform threads are used. So, if it turns out virtual threads pose a problem for servers that regularly but only occasionally reach 100% CPU for 10 seconds and yet behave well under some other scheduling regime, we would certainly address that problem.

BTW, a large company has reported an actual problem that time-sharing could solve, but only for custom schedulers. Their problem is batch tasks that can consume CPU for many minutes at a time, and they want to pause them and re-run them during evening downtimes because they want to spread out their batch jobs and offer better latency, but we’re talking very high latencies here (they don’t have a good solution today, but are hoping that virtual threads with custom schedulers could help). That’s an actual problem that’s been observed that a certain kind of time-sharing could address, but it’s not the kind of time-sharing you have in mind.

Reporting a problem is very helpful to us, so when you find one, please report it here!

— Ron

Unless otherwise stated above:

Compagnie IBM France
Siège Social : 17, avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 664 069 390,60 €
SIRET : 552 118 465 03644 - Code NAF 6203Z
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230106/15005e75/attachment-0001.htm>