[PATCH] JDK-8205051 (UseNUMA memory interleaving vs cpunodebind & localalloc)

Tue Oct 2 08:22:03 UTC 2018

Hi Roshan,

On Fri, 2018-09-28 at 15:01 +0530, roshan mangal wrote:
> Hi Thomas,
> 
> > Hi Roshan,
> > 
> > On Tue, 2018-09-25 at 12:18 +0530, roshan mangal wrote:
> > > Hi All,
> > > 
> > > This Patch is for 
> > > https://bugs.openjdk.java.net/browse/JDK-8205051
> > > 
> > > Issue:
> > > 
> > > If the JVM isn't allowed to run on all of the nodes (by numactl,
> > > cgroups, docker, etc), then a significant fraction of the Java
> > > heap will be unusable, causing early GC.
> > > 
> > > Every Thread captures their locality group(lgrp) and allocates
> > > memory from that lgrp.
> > > 
> > > lgrp id is same as NUMA node id.

Is there a compelling reason to have two identifiers for the same
thing? I am just asking, because it is confusing to use both in the same code interchangably.

> > > 
[...]
> > > Thread's lgrp
> > >   Order of Allocation in NUMA node
> > > 
> > >   lgrp0             [ numaNode0->numaNode1->numaNode2->numaNode3-
> > > >
> > > numaNode4->numaNode5->numaNode6->numaNode7 ]
> > >   lgrp1             [ numaNode1->numaNode0->numaNode2->numaNode3-
> > > >
> > > numaNode4->numaNode5->numaNode6->numaNode7 ]
> > >   lgrp2             [ numaNode2->numaNode0->numaNode1->numaNode3-
> > > >
> > > numaNode4->numaNode5->numaNode6->numaNode7 ]
> > >   lgrp3             [ numaNode3->numaNode0->numaNode1->numaNode2-
> > > >
> > > numaNode4->numaNode5->numaNode6->numaNode7 ]
> > >   lgrp4             [ numaNode4->numaNode5->numaNode6->numaNode7-
> > > >
> > > numaNode0->numaNode1->numaNode2->numaNode3 ]
> > >   lgrp5             [ numaNode5->numaNode4->numaNode6->numaNode7-
> > > >
> > > numaNode0->numaNode1->numaNode2->numaNode3 ]
> > >   lgrp6             [ numaNode6->numaNode4->numaNode5->numaNode7-
> > > >
> > > numaNode0->numaNode1->numaNode2->numaNode3 ]
> > >   lgrp7             [ numaNode7->numaNode4->numaNode5->numaNode6-
> > > >
> > > numaNode0->numaNode1->numaNode2->numaNode3 ]
> > 
> > I have a question about this: lgrps often have the same distance
> > from each other, and this order-of-allocation list seems to be
> > deterministic. So in this case nodes with lower lgrp id (but the
> > same distance) are preferred to ones with higher lgrp id.
> > 
> > Do you expect some imbalance because of that? If so, couldn't it be
> > useful to randomize lgrps with the same distance in this list, and
> > regularly change them?
> 
> Yes, I agree there will be an imbalance because of that.
> Another option would be to select lgrp based on largest free memory
> available. What is your opinion on this ?
> 
> For Example,
> node   0   1   2   3
> 0:   10  16  16  16
> 1:   16  10  16  16
> 2:  16  16  10  16
> 3:  16  16  16  10
> 
> Threads T0,T1,T2 and T3 are running on different numa node with
> lgrp0,lgrp1,lgrp2 and lgrp3 respectively.
> Allocation within the node has distance 10 but outside is 16.
> Suppose lgrp0 gets OOM and memory available at that time is lgrp0(0%)
> lgrp1(10%) lgrp2(5%) lgrp3(50%).
> If we choose 'lgrp' randomly e.g. lgrp2 over lgrp3 for T0, lgrp2 will
> be filled faster and once lgrp2 is full, T2 also need to go other
> lgrp with distance 16.
> In this scenario, T0 choosing lgrp3 with 50% free memory will be a
> better option, which will allow T2 to run for longer time on lgrp2
> having lower memory latency (10).

I was more concerned about performance in that case actually: as
mentioned, in the current implementation all threads are somewhat
likely to access the memory of the same node, the same with this idea
of selecting the one with the largest amount of memory.

This means that (assuming some very naive hardware implementation), the
memory controller on that node will need to handle requests from "all"
other nodes, meaning some potential contention.

If the accesses were more spread out across nodes, memory accesses
would be more spread out across the network, allowing use of higher
memory bandwidth as long as possible.

That of course is a rather naive model of the interconnect :)

I am not so worried about the case you mention: with the use of
NUMAAllocationDistanceLimit you extend the time until GC until at most
all nodes are full anyway.

> Selecting the best possible lgrp dynamically will be a better option
> instead of deterministic selection (as in my patch) but that will add
> slightly more overhead during memory allocation.
> Will that be ok?

Since memory allocation is on TLAB basis, and we expect TLABs to be
reasonably large, some more time used should not matter.
The MutableSpace allocation path is already only used with NUMA
enabled.

> > > Allocation on NUMA node, which is far from CPU can lead to
> > > performance issue. Sometimes triggering GC is a better option
> > > than allocating from NUMA node at large distance i.e. high memory
> > > latency.
> > >
> > > For this, I have added option "NumaAllocationDistanceLimit",
> > > which will restrict memory allocation from the far nodes.
> > > 
> > > In above system if we set -XX:NumaAllocationDistanceLimit=16.
> > 
> > That makes sense imho, although it is a bit sad that this number is
> > specific to the machine.

[...]
> > 
> >   could you send me a patch as webrev so I can put it on
> > cr.openjdk.java.net? (Or maybe sending the patch as attachment
> > helps too). It got mangled by your email program, adding many
> > linebreaks.
> > 
> 
> Please find patch as attachment.

I created a webrev at 
http://cr.openjdk.java.net/~tschatzl/8205051/webrev from it. 

Some comments:

 - plaese add range constraints for NUMAAllocationDistanceLimit to
disallow negative limits. Or are there any reasons to allow negative
distance limits?
I am aware that int is not the perfect type for this, but otoh the API
returns ints for it, so we need to stick to it.
Examples can be found in runtime/globals.hpp (look for "range").

 - please make sure that the code complies to Hotspot coding style,
particularly for the "numaNode" class. Indentation, member, local
variable and other naming is off.

https://wiki.openjdk.java.net/display/HotSpot/StyleGuide

Also look at code e.g. in mutableSpace.hpp.

 - would it be possible to factor out a "allocate_from_node()" (or
"try_allocate_from_node()" method from the MutableSpace::allocate code
and use it. Basically the code in the while-loop (now). That would
probably make the code a lot more readable.

 - I am almost sure that the code does not compile on Solaris. Since
mutablespace is shared code, it needs to still compile and run there
too. Please at least add suitable functional stubs for Solaris (and any
other OSes) to be filled out if you can't provide them.

That code btw is the one with the mixing of the "NumaNode" and "lgrp"
identifiers :( Maybe "numaNode" can be replaced by something more
suitable.

 - some typos in the identifier names: "avaliable" -> "available"

> I have run specjbb2015 composite run with "numactl -N 0 <composite
> program>" on 8 NUMA node system.
> it has improved score
>  composite Max-jOPS: +29%
>  composite Critical-jOPS: +24%

You mean that when bound to a single node, we do not regress that much
any more?

Are there any changes in performance in the case when all nodes are in
use before and after your change with NUMA support?

Thanks,
  Thomas