[External] : Re: Cache topology aware scheduling

Thu Sep 5 17:51:25 UTC 2024

On Sep 5, 2024, at 1:18 PM, robert engels <robaho at icloud.com> wrote:

Sounds interesting. I would think you need a bit more context than a simple lock can provide - when in the presence of a thread of execution that accesses multiple locks.

We know the NUMA node ID for each of the threads queued on the lock, and the current owner, which is all we needed.   We manage admission order for each lock independently.

For user-mode, sched_getcpu() is a fast VDSO system call, and on x86 we can also get the CPUID and NUMA node ID cheaply via RDTSCP.    Also, the kernel migrates inter-node only rarely (so rarely it’s not even work conserving, but that’s a distinct concern) so it’s reasonable to cache the value over short periods.

You can refine the admission decisions to take shared L3 into account by generating a distance function from the kind of topology information you could manual query via lscpu or lstopo.

It might also be useful to take P/E (fire/ice) cores into account as well.

Regards
Dave

I can’t read the paper at the moment but will for sure!

On Sep 5, 2024, at 12:13 PM, Dave Dice <dave.dice at oracle.com> wrote:

On Sep 5, 2024, at 1:06 PM, robert engels <robaho at icloud.com> wrote:

I think to get optimum performance in most cases you need to be able to control the core affinity based on X. It seems a custom scheduler would be a very cool way to address this.

For the compact NUMA-aware locks approach we decided to tolerate whatever the ambient placement happened to be, and just shift the order of admission.   This worked well for our purposes and was relatively non-invasive, and didn’t suborn any of the existing placement or affinity policies.

Regards
Dave

On Sep 5, 2024, at 11:50 AM, Dave Dice <dave.dice at oracle.com> wrote:

 It’s possible some of the work done in Oracle Labs on NUMA-aware locks might be applicable in this context.  Briefly,  we reorder the list of threads waiting on a lock to prefer “near” handovers in the short-term, but still preserve long-term fairness.   Most of the keys ideas map cleanly over to schedulers.   The paper appeared in EuroSys 2019, but the link below is to the non-paywalled arXiv version.    Some of the things we needed to do — making all the operations constant-time, and dealing with concurrent arriving threads, lock-free list manipulation — could likely be relaxed in the context of user-level scheduling.

Regards
Dave

<https://urldefense.com/v3/__https://arxiv.org/pdf/1810.05600__;!!ACWV5N9M2RV99hQ!NeULV5CzX1E8068n9zdQEghQSSOvEvhP0EsEW6-Wb12OJ_98OTu1GB_29u7C3jEs5GlVpeZMBpK81co$>
<preview.png>
1810<https://urldefense.com/v3/__https://arxiv.org/pdf/1810.05600__;!!ACWV5N9M2RV99hQ!NeULV5CzX1E8068n9zdQEghQSSOvEvhP0EsEW6-Wb12OJ_98OTu1GB_29u7C3jEs5GlVpeZMBpK81co$>
PDF Document · 1.3 MB<https://urldefense.com/v3/__https://arxiv.org/pdf/1810.05600__;!!ACWV5N9M2RV99hQ!NeULV5CzX1E8068n9zdQEghQSSOvEvhP0EsEW6-Wb12OJ_98OTu1GB_29u7C3jEs5GlVpeZMBpK81co$>

On Sep 5, 2024, at 12:16 PM, Francesco Nigro <nigro.fra at gmail.com> wrote:

Hi @Danny Thomas<mailto:dannyt at netflix.com>

We're working (nudge nudge Andrew Haley) on a custom scheduler API - as mentioned by Alan, which enables (expert) users/framework devs to implement something like this - and more :)

Cheers,
Franz

Il giorno lun 2 set 2024 alle ore 10:41 Alan Bateman <alan.bateman at oracle.com<mailto:alan.bateman at oracle.com>> ha scritto:
On 02/09/2024 07:23, Danny Thomas wrote:
> Hi folks,
>
> I was giving some thought to our adoption of Xen 4 coinciding with
> virtual threads being available, and it occurred to me with an
> increasing number of architectures clustering L3 and L2 caches between
> groups of cores on a die, that virtual threads scheduling in user
> space could make them particularly well suited to these architectures,
> if the scheduler were topology aware.
>
> Have you given any thought to worker CPU affinity and/or locality to
> an existing worker when a virtual thread is started by another? Would
> you consider this something to be proved out by custom schedulers, or
> is this enough of a trend to justify future investment in the default
> scheduler?

To date, we've put CPU and node affinity into the "custom scheduler"
topic, which is still TBD on whether to expose. If you have data from
any experiments with the current EA builds then it would be useful to
see. The current EA builds allow the the default FJP based scheduler to
be replaced for experimentation purposes.

In a system with a mix of schedulers then starting a virtual thread will
"inherit" the scheduler when not configured. That seems a sensible default.

-Alan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20240905/483d62dc/attachment-0001.htm>