effectiveness of jdk.virtualThreadScheduler.maxPoolSize

Mon Jan 9 18:19:02 UTC 2023

I think it would be interesting to explain in more detail the effects of scheduling, and why the question of time-sharing is not obvious and so crucially depends on real-world data.

Suppose you are given 10 tasks, 5 of them have a processing duration of 0ms, and 5 of them have a duration of 100ms. For simplicity, let’s assume we have no parallelism. Both a shortest-task-first and a longest-task-first will complete all tasks in 500ms, but their average task latency will be quite different. That of the shortest-task-first scheduler will be 150ms (= (5*0 + 100 + 200 + 300 + 400 + 500)/10), while that of the longest-task-first scheduler will be 400ms (= 100 + 200 + 300 + 400 + 500 + 5*500)/10). A perfect time-sharing scheduler (with zero overhead and infinitesimal granularity) would yield an average latency of 250ms (= 0*5 + 500*5). Those are big differences!

But now let’s consider a server where an infinite stream of requests arrive from the outside, half of them with a processing duration of 0ms and half with a duration of 100ms. Regardless of how we schedule the requests in the queue that may form, because the average request duration is 50ms, as long as the request rate is less than or equal to 20 req/s the server will be stable. If the rate is higher than that, the server will become unstable with requests piling up in an ever-growing queue and the latency will climb to infinity — again, regardless of scheduling policy.

What about latency? The average latency will depend on the distribution of requests. Without time-sharing, it can range between 50ms and 100ms; with perfect time-sharing it may be much higher (i.e. worse). Perfect time-sharing will decrease the latency of the short tasks (to 0ms!) at the expense of increasing the latency of the long tasks.

But say we conclude that reducing latencies of short tasks at the expense of long tasks is what everyone always wants; that’s not entirely obvious, but not completely unreasonable, either. Suppose that at the same request rate of 20 per second, the probability for a long task weren't 0.5 but 0.55, or that instead of 100ms it takes 110 ms. In that situation time sharing can no longer help — the server will destabilise. Alternatively, suppose that the probability of a long task is 0.05 or that its duration is 50 ms; time sharing is no longer effective, either. So at 20 req/s, within the band of 50-100ms or 0.05-0.5 probability time sharing can help; above it or below it — it doesn’t.

Keeping in mind that time-sharing can’t actually be perfect and that no request can actually have a duration of zero, I hope it is now clearer why we’re so curious to find real-world cases and why simulations provide little insight. It’s easy to construct an artificial simulation at the operational band where time-sharing is effective, but it’s precisely because, in practice, it is most effective when the server is on the edge of stability and becomes gradually less effective the further away we are from that tipping-point that the most important questions become: how often do servers operate within that operational band, where exactly along that band do they commonly find themselves, and how does that situation arise in real-world scenarios? Only when we get real-wold data can we answer those questions and can consider the pros and cons, and only then can we either conclude that the work isn’t worth it or be able to satisfactorily justify it.

(Note that for the classic case where time sharing helps— some fixed set of background processing operations — there is no need to add time-sharing to the virtual thread scheduler, as a better solution is already available.)

— Ron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230109/acf89621/attachment.htm>