: Re: Project Loom VirtualThreads hang

Fri Jan 6 18:32:50 UTC 2023

On 6 Jan 2023, at 16:56, Arnaud Masson <arnaud.masson at fr.ibm.com<mailto:arnaud.masson at fr.ibm.com>> wrote:

My scenario is not when you have ten thousands of CPU bound tasks.
The scenario I’m talking about is when you have thousands of IO bound tasks and just enough CPU-bound tasks (in the same scheduler) to pin most of the carrier threads for some time.

That scenario still doesn’t amount to a problem, let alone we can investigate and solve and know we’ve solved. There are many reasons why this hypothetical scenario is unlikely to arise in practice, and/or that if it does, time-sharing won’t be able to help.

When I talk about CPU-bound task here, it’s the worst case when there is not even a yield() or random small IO so it won’t allow to switch the carrier thread until fully completed if I understand correctly.

Example:
A webapp with a Loom scheduler with 4 native threads.
The server gets continuously on average 100 concurrent IO-bound requests: ok
Then it gets a one-shot group of 4 CPU-bound requests, pure CPU (no yield) stuff taking 10 secs each.
Won’t the 4 carrier threads be sticking to the 4 CPU-bounds requests preventing progress on the other IO-bound requests for 10 secs, then resume work on IO-bound requests?

The issue here is how is such a server expected to work in the first place? Because of all the short tasks, time-sharing would make those expensive tasks complete in a lot more than 10 seconds, and how is it that the careful balance between the task types is maintained to ensure such a server could work? For time-sharing to improve matters, you’d need just a high enough number of big CPU-bound tasks to saturate the CPU but low enough so that those tasks don’t end up overwhelming the server.

A server of the kind you’d describing would occasionally but consistently reach 100% CPU for at least 10 seconds even if platform threads are used. So, if it turns out virtual threads pose a problem for servers that regularly but only occasionally reach 100% CPU for 10 seconds and yet behave well under some other scheduling regime, we would certainly address that problem.

BTW, a large company has reported an actual problem that time-sharing could solve, but only for custom schedulers. Their problem is batch tasks that can consume CPU for many minutes at a time, and they want to pause them and re-run them during evening downtimes because they want to spread out their batch jobs and offer better latency, but we’re talking very high latencies here (they don’t have a good solution today, but are hoping that virtual threads with custom schedulers could help). That’s an actual problem that’s been observed that a certain kind of time-sharing could address, but it’s not the kind of time-sharing you have in mind.

Reporting a problem is very helpful to us, so when you find one, please report it here!

— Ron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230106/c0afa030/attachment-0001.htm>