[PATCH] Linux NUMA support for HotSpot

Mon Mar 3 04:57:19 PST 2008

On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote:

> 
> > > all working quite well, actually.
> >
> > One obvious issue I found in the Solaris code was that it neither
> > binds threads to nodes (afaik), nor tries to keep up with a thread
> > migrating to another node. It just assumes always the same thread:node
> > mapping which surely cannot be correct?
> 
> It is however correct. Solaris assigns a home locality group (a node) to each 
> lwp. And while lwps can be temporary migrated to a remote group, page 
> allocation still happens on a home node and the lwp is predominantly 
> scheduled to run in its home lgroup. For more information you could refer to 

Interesting. I tried a similar scheduling algorithm on Linux a long time ago
(it was called the "homenode scheduler") and it was a general loss 
on my testing on smaller AMD systems. But maybe Solaris does it all different.

Anyways on Linux that won't work because it doesn't have the concept
of a homenode.

The other problems is that it seemed to always assume all the threads
will consume the whole system and set up for all nodes, which seemed dodgy.

> the NUMA chapter of the "Solaris Internals" book or to blogs of Jonathan Chew 
> and Alexander Kolbasov from Solaris CMT/NUMA team.
> 
> >
> > On the Linux implementation I solved that by using getcpu() on
> > each allocation (on recent Linux it is a special optimized fast path
> > that is quite fast)
> 
> I doubt that executing a syscall on every allocation (even if it's a TLAB 
> allocation) is a good idea. It's many times slower than the original "bump 

A vsyscall is not a real syscall. It keeps running in ring 3 and just
is some code the kernel maps into each user process. It's not more
expensive than any indirect function call. 

The getcpu() vsyscall was especially designed for use by such NUMA aware
allocators.

> the pointer with the CAS spin" allocator. Linux scheduler is quite reluctant 
> to move lwps between the nodes, so checking the lwp position every, say, 64 
> TLAB allocations proved to be adequate. On Solaris even that is not 
> necessary.

getcpu() does this already by keeping a cache and a time stamp to check
once per clock tick. This means it used to, in the latest kernels it is so fast 
now that even that was removed.

-Andi