[PATCH] Linux NUMA support for HotSpot

Mon Mar 3 04:19:13 PST 2008

On Monday 03 March 2008 14:30:17 Andi Kleen wrote:
> On Mon, Mar 03, 2008 at 12:52:40PM +0300, Igor Veresov wrote:
> > I haven't studied your changes in detail but I have a NUMA-aware
> > allocator for Linux in works
>
> Ok maybe you can do something with my patch then.
>
> > and I do see speedups, which are similar to what I was able to
> > get from Solaris. About 8% for specjbb2005 on a dual-socket Opteron. So
> > it's
>
> Ok I only did micro benchmarks. Maybe they were not strong enough.
> For some simple allocations I didn't get any numa local placement
> at least according to the benchmark numbers.
>

Well, obviously on microbenchmarks you should see even more speedup.
To be exact, there is a 30% difference in latency in a 2-socket Opteron (1 HT 
hop) system and there's even more for 2 and 3-hop systems.

> > all working quite well, actually.
>
> One obvious issue I found in the Solaris code was that it neither
> binds threads to nodes (afaik), nor tries to keep up with a thread
> migrating to another node. It just assumes always the same thread:node
> mapping which surely cannot be correct?

It is however correct. Solaris assigns a home locality group (a node) to each 
lwp. And while lwps can be temporary migrated to a remote group, page 
allocation still happens on a home node and the lwp is predominantly 
scheduled to run in its home lgroup. For more information you could refer to 
the NUMA chapter of the "Solaris Internals" book or to blogs of Jonathan Chew 
and Alexander Kolbasov from Solaris CMT/NUMA team.

>
> On the Linux implementation I solved that by using getcpu() on
> each allocation (on recent Linux it is a special optimized fast path
> that is quite fast)

I doubt that executing a syscall on every allocation (even if it's a TLAB 
allocation) is a good idea. It's many times slower than the original "bump 
the pointer with the CAS spin" allocator. Linux scheduler is quite reluctant 
to move lwps between the nodes, so checking the lwp position every, say, 64 
TLAB allocations proved to be adequate. On Solaris even that is not 
necessary.

igor