effectiveness of jdk.virtualThreadScheduler.maxPoolSize

Mon Jan 9 00:07:19 UTC 2023

I don’t know how to evaluate and compare solutions before knowing what the problem they supposedly solve is, so I have no way of knowing which if any of time-sharing for virtual threads or adding scheduler workers is a better solution to a problem that hasn’t been reported.

If servers employing virtual threads tend to reach conditions where time-sharing could help, then when the problem is reported we would be more than happy to fix it with that solution. What I’m trying to convey is not that I think your hypothesis must be wrong, but that it is not necessarily correct, either, and so we are simply unable to fix hypothetical bugs. It really doesn’t matter how strongly anyone feels that some problem *could* arise. We cannot fix bugs before they are reported, so that we can assess their severity, prioritise them, and consider solutions. If you do find a problem with virtual threads, please report it to this mailing list.

— Ron

On 8 Jan 2023, at 23:19, Robert Engels <rengels at ix.netcom.com<mailto:rengels at ix.netcom.com>> wrote:

We’ll have to agree to disagree. I think servers routinely hit 100% cpu and rely on the scheduler to deprioritize tasks to be fair - maybe not “forever” but for extended periods.

This is not dissimilar to the various background tools for cosmos searching that use the available cycles - as long as nothing else needs them.

I am guessing a lot of long running simulations work similarly.

As I’ve said though I don’t think it’s a huge deal - move things that have to run to their own native thread pool. Maybe that’s a better and simpler solution than trying to add time slicing to vthreads anyway.

On Jan 8, 2023, at 5:05 PM, Ron Pressler <ron.pressler at oracle.com<mailto:ron.pressler at oracle.com>> wrote:

On 8 Jan 2023, at 18:58, robert engels <rengels at ix.netcom.com<mailto:rengels at ix.netcom.com>> wrote:

But even if not using spin locks - with fair scheduling you expect shorter runtime tasks to be completed before long-running cpu bound tasks. The linux scheduler will lower the priority of threads that hog the cpu too much to facilitate this even further (or use a scheduler type of ‘batch/idle’ - i.e. only run when nothing else needs to run).

If you use spin locks and have significantly more threads than cores, you may well experience an orders of magnitud slow-down. I.e., you don’t use spin locks and rely on time-sharing to make them work well — there won’t be a deadlock, but you won’t get an acceptable behaviour, either.

So if very short tasks get stuck behind long-running cpu bound tasks this is unexpected behavior - it is not necessarily incorrect. If you spawned more carrier threads (i.e. when the scheduler feels the tasks are not making “progress”) you give more of a chance for the OS scheduler to give cpu time to the short lived tasks. I think that is what the OP was trying to say.

This may seem obvious at first, but the easiest way to explain why it’s hard to find an actual problem caused by this (and why you don’t actually rely on the “expected behaviour”) is to remember that the OS will do this (pretty-much) only when your CPU is at 100%. But at that point, there’s very little the OS can do to make your server behave well. When you’re at 100% CPU for any significant duration, since requests keep coming, they’re piling up and your server is now unstable. In other words, time sharing only kicks in when it can’t do much good.

In non-realtime kernels, time-sharing is mostly used to run a small number of background tasks or to keep the machine responsive to an operator in cases of emergency. Clever scheduling is not enough to compensate for lack of resources. Time sharing can also be useful for to smooth over the latency of CPU-bound batch-processing tasks, but since it only makes sense to run only a few of them in parallel, virtual threads can’t give you any benefit there anyway.

However, as I’ve said multiple times already, we don’t rule out the possibility of a workload where time sharing could solve an actual problem. So if you encounter a problem, please report it.

— Ron

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230109/a55cebb1/attachment-0001.htm>