[PATCH] Linux NUMA support for HotSpot

Mon Mar 3 06:20:26 PST 2008

On Monday 03 March 2008 16:32:05 Andi Kleen wrote:
> On Mon, Mar 03, 2008 at 04:25:53PM +0300, Igor Veresov wrote:
> > On Monday 03 March 2008 15:57:19 Andi Kleen wrote:
> > > On Mon, Mar 03, 2008 at 03:19:13PM +0300, Igor Veresov wrote:
> > > > > > all working quite well, actually.
> > > > >
> > > > > One obvious issue I found in the Solaris code was that it neither
> > > > > binds threads to nodes (afaik), nor tries to keep up with a thread
> > > > > migrating to another node. It just assumes always the same
> > > > > thread:node mapping which surely cannot be correct?
> > > >
> > > > It is however correct. Solaris assigns a home locality group (a node)
> > > > to each lwp. And while lwps can be temporary migrated to a remote
> > > > group, page allocation still happens on a home node and the lwp is
> > > > predominantly scheduled to run in its home lgroup. For more
> > > > information you could refer to
> > >
> > > Interesting. I tried a similar scheduling algorithm on Linux a long
> > > time ago (it was called the "homenode scheduler") and it was a general
> > > loss on my testing on smaller AMD systems. But maybe Solaris does it
> > > all different.
> > >
> > > Anyways on Linux that won't work because it doesn't have the concept
> > > of a homenode.
> >
> > Yes, but it has static memory binding instead, which alleviates this
> > problem.
>
> That would require statically binding the threads too which is by default
> not a good idea without explicit user configuration

Not necessarily. It works fine without the static cpu binding. Keep in mind, 
that most data we have in young generation is short-lived anyway and if the 
scheduler is reluctant enough to move threads between the nodes the 
application will have enough time to manipulate the data locally. 

For long-living data, yes, this won't work.

>
> The reasoning is that not using a CPU is always worse than using
> remote memory at least on systems with reasonable small NUMA factor.
>
> (that is what killed the homenode scheduler too)

As I've mentioned before, Solaris will run the thread remotely if there is a 
significant load imbalance. Because indeed, it's better to run remotely than 
not to run at all. But this thread will return to its home node at first 
opportunity. 

>
> > > The other problems is that it seemed to always assume all the threads
> > > will consume the whole system and set up for all nodes, which seemed
> > > dodgy.
> >
> > You mean the allocator? Actually it is adaptive to the allocation rate on
> > a node, which in effect makes the whole eden space usable for
> > applications with asymmetric per-thread allocation rate. This of course
> > also helps with the case when the number of threads is less than the
> > number of nodes.
>
> It didn't seem to adapt though. Or maybe I'm misremembering the code,
> it was some time ago.

It will start adapting after 5 minor GCs or so, after it has enough statistics 
to make a decision. Try running with -XX:+UseNUMA and -XX:
+PrintGCDetails -XX:+PrintHeapAtGC on Solaris and you'll see how the heap is 
being reshaped.

igor