effectiveness of jdk.virtualThreadScheduler.maxPoolSize

Mon Jan 9 19:35:44 UTC 2023

On 9 Jan 2023, at 19:02, Arnaud Masson <arnaud.masson at fr.ibm.com<mailto:arnaud.masson at fr.ibm.com>> wrote:

My concern right now is not constant high rate of slowCPURequets or having Loom threads handle arbitrary high CPU load for long time.
It’s just that Loom (default Executor) makes some real scenarios worse compared to classic Executor, like the modest burst of slowCPURequests causing brutal pause on fastIORequests in the code example.

I don’t think that’s true. The only scenarios where we’ve seen that situation is “worse” (under the assumption that you want to increase latency for long tasks and decrease it for short tasks) is are those where it is trivial to pick a different scheduler once the programmer decides they have a preference for one scheduling policy over another. We have not seen situations where virtual threads are the obvious choice and yet you see a problem. We would be very interested in finding them so we could study them. But please assume that any *hypothetical* scenario that could be considered has already been.

I was hoping to clarify why after spending years considering hypotheses we’re at a point we need more actual data, not the same hypotheses we’ve already considered. Every single hypothetical case you bring up has already been considered, and we concluded that the only thing we can do is wait until someone actually encounters it in the field.

(Again, there are workarounds, but it makes Loom transition more complex/sensitive for _some_ webapps.)

A workaround implies someone has encountered a problem. We first need to establish that some webapps find this more complex, and we can’t even do that until they report a problem.

About the operational band:
Let’s say C is the number is of available CPUs (or Pod CPU limit)
and M the max number of active native threads beyond which your pod becomes useless.
You assume any concurrent CPU activity between C and M is irrelevant, but it doesn’t make sense for me: why would it be better to be stuck suddenly with long pauses at C rather than scaling gracefully until M?

Time-sharing doesn’t “scale gracefully”. It decreases the latency for some tasks at the cost of increasing it for others (and sometimes increasing the average), while having no impact on scalability (scheduling cannot affect scalability).

Interestingly, the scenario you consider (postponing a batch job to run at a much later time) is less relevant in my opinion in modern cloud environment where you want your pod to migrate smoothly but quickly (save state/stop/restart/reload state in a few minutes max) during cluster scaling for example.

Quite possibly, but given that a big company has actually presented us with a real problem they encountered while no one has reported a problem of the kind we’re discussing, our only course of action is to prioritise the problem that we know exists. There really is absolutely nothing we can do about problems that have not been reported, no many how many times we consider hypotheticals.

If you can’t accept it as the only reasonable course of action, please accept it as an arbitrary axiom: until someone reports a problem they’ve actually encountered in some reasonable program, we cannot consider requests for changes. If the problem is likely to occur, then someone will surely report it soon enough and we could start studying it.

There’s no point in arguing over this axiom, and the only insight that can help us move forward is real data from the field.

— Ron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230109/0d57155c/attachment.htm>