Cache topology aware scheduling

Fri Oct 4 06:13:29 UTC 2024

After quite a bit of experimentation, I can at least say that last level
cache aware task placement on 4th Generation EPYC (Genoa) is a real boon. I
generalised my original approach, because it doesn't involve customizing
the nodes-per-socket setting (which we can't do on AWS anyway, NPS = 1),
introduce the risks/complexity of processor isolation and per-thread
affinity, or make the scheduler's life too difficult:

https://github.com/DanielThomas/virtual-threads-cluster-aware/blob/main/src/main/java/com/netflix/sandbox/ClusteredExecutors.java

With virtual-to-virtual-thread submissions and particularly structured
concurrency providing a heuristic for locality, I'm convinced there's a
significant opportunity here. I've still got some more real world
experiments to run, but will get a TechBlog post up when I have something
to share.

On Sat, Sep 14, 2024 at 2:35 AM Alan Bateman <alan.bateman at oracle.com>
wrote:

> On 13/09/2024 04:55, Danny Thomas wrote:
> > Even with 10s of thousands of tasks queued, it looks like it's more
> > than fast enough as a heuristic. I'm now doing a choice of two, with
> > the current processor's pool being the preferred choice. For the
> > simple external submit, starting a virtual thread that spawns another
> > sharing data, I see up to a 25% improvement in throughput (pleasingly,
> > the default scheduler occasionally accidentally lands workers close to
> > each other and comes within a few percent).
> >
> > I think we want to be as sticky as we can to the current
> > worker/cluster, so ForkJoinWorkerThread.hasKnownQueuedWork is probably
> > too conservative as a heuristic, but thanks for the heads up.
> >
> > Have you gotten as far as thinking about how yielding and compensation
> > will be exposed?
>
> I think the experiment that Francesco may be based on the prototype
> VirtualThreadTask interface that we had temporarily exposed in EA builds
> some time ago. That gave the mapping of task to virtual Thread that thus
> thread state and park blocker when yielding.
>
> There isn't much need for compensating right now, at least not since the
> changes to Object.wait to preempt when waiting. There is still a need to
> support reverse DNS lookups but that has an SPI now [1] so a different
> resolver can be deployed if needed.
>
> As to your question, then this project hasn't decided whether to expose
> anything. There at least 3 exploration efforts going on right now, two
> with implClass, the other (I think) with prototype API, and we want to
> see what we can learn from these experiments.
>
> -Alan
>
> [1] https://openjdk.org/jeps/418
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20241004/1af763b9/attachment.htm>