[PATCH] JDK-8205051 (UseNUMA memory interleaving vs cpunodebind & localalloc)

Fri Sep 28 09:31:23 UTC 2018

Hi Thomas,

> Hi Roshan,
>
> On Tue, 2018-09-25 at 12:18 +0530, roshan mangal wrote:
> > Hi All,
> >
> > This Patch is for https://bugs.openjdk.java.net/browse/JDK-8205051
> >
> > Issue:
> >
> > If the JVM isn't allowed to run on all of the nodes (by numactl,
> > cgroups, docker, etc), then a significant fraction of the Java heap
> > will be unusable, causing early GC.
> >
> > Every Thread captures their locality group(lgrp) and allocates memory
> > from that lgrp.
> >
> > lgrp id is same as NUMA node id.
> >
> > Thread running on CPU belongs to NUMA node 0, will capture Thread-
> > >lgrp as lgrp0 and will allocate memory from NUMA node 0. Once NUMA
> > node 0 is full, it will trigger GC irrespective of other NUMA node
> > having memory.
> >
> > Solution proposed:
> >
> >  Create List of NUMA nodes based on distance and allocate memory from
> > near NUMA node when other closest NUMA node is/are full.
> >
> > Below system has eight NUMA nodes and distance table given below.
> >
> > node distances:
> >
> > node   0   1   2   3   4     5    6    7
> >   0:   10  16  16  16  32  32  32  32
> >   1:   16  10  16  16  32  32  32  32
> >   2:  16  16  10  16  32  32  32  32
> >   3:  16  16  16  10  32  32  32  32
> >   4:  32  32  32  32  10  16  16  16
> >   5:  32  32  32  32  16  10  16  16
> >   6:  32  32  32  32  16  16  10  16
> >   7:  32  32  32  32  16  16  16  10
> >
> > The corresponding list for each lgrp will be like this.
> >
> > Thread's lgrp
> >   Order of Allocation in NUMA node
> >
> >   lgrp0             [ numaNode0->numaNode1->numaNode2->numaNode3->
> > numaNode4->numaNode5->numaNode6->numaNode7 ]
> >   lgrp1             [ numaNode1->numaNode0->numaNode2->numaNode3->
> > numaNode4->numaNode5->numaNode6->numaNode7 ]
> >   lgrp2             [ numaNode2->numaNode0->numaNode1->numaNode3->
> > numaNode4->numaNode5->numaNode6->numaNode7 ]
> >   lgrp3             [ numaNode3->numaNode0->numaNode1->numaNode2->
> > numaNode4->numaNode5->numaNode6->numaNode7 ]
> >   lgrp4             [ numaNode4->numaNode5->numaNode6->numaNode7->
> > numaNode0->numaNode1->numaNode2->numaNode3 ]
> >   lgrp5             [ numaNode5->numaNode4->numaNode6->numaNode7->
> > numaNode0->numaNode1->numaNode2->numaNode3 ]
> >   lgrp6             [ numaNode6->numaNode4->numaNode5->numaNode7->
> > numaNode0->numaNode1->numaNode2->numaNode3 ]
> >   lgrp7             [ numaNode7->numaNode4->numaNode5->numaNode6->
> > numaNode0->numaNode1->numaNode2->numaNode3 ]
>
> I have a question about this: lgrps often have the same distance from
> each other, and this order-of-allocation list seems to be
> deterministic. So in this case nodes with lower lgrp id (but the same
> distance) are preferred to ones with higher lgrp id.
>
> Do you expect some imbalance because of that? If so, couldn't it be
> useful to randomize lgrps with the same distance in this list, and
> regularly change them?

Yes, I agree there will be an imbalance because of that.
Another option would be to select lgrp based on largest free memory
available. What is your opinion on this ?

For Example,
node   0   1   2   3
0:   10  16  16  16
1:   16  10  16  16
2:  16  16  10  16
3:  16  16  16  10

Threads T0,T1,T2 and T3 are running on different numa node with
lgrp0,lgrp1,lgrp2 and lgrp3 respectively.
Allocation within the node has distance 10 but outside is 16.
Suppose lgrp0 gets OOM and memory available at that time is lgrp0(0%)
lgrp1(10%) lgrp2(5%) lgrp3(50%).
If we choose 'lgrp' randomly e.g. lgrp2 over lgrp3 for T0, lgrp2 will
be filled faster and once lgrp2 is full, T2 also need to go other lgrp
with distance 16.
In this scenario, T0 choosing lgrp3 with 50% free memory will be a
better option, which will allow T2 to run for longer time on lgrp2
having lower memory latency (10).

> Long ago I have been implementing some NUMA support for G1 and had that
> issue (in this case the distribution is a nice lattice with everyone
> connected by everyone else with two hops), with the above mentioned
> solution to that "problem".
>
> Do you think something like this would make sense (not particularly in
> this change).
>
Selecting the best possible lgrp dynamically will be a better option
instead of deterministic selection (as in my patch) but that will add
slightly more overhead during memory allocation.
Will that be ok?

> > Allocation on NUMA node, which is far from CPU can lead to
> > performance issue. Sometimes triggering GC is a better option than
> > allocating from NUMA node at large distance i.e. high memory latency.
> >
> > For this, I have added option "NumaAllocationDistanceLimit", which
> > will restrict memory allocation from the far nodes.
> >
> > In above system if we set -XX:NumaAllocationDistanceLimit=16.
>
> That makes sense imho, although it is a bit sad that this number is
> specific to the machine.
>
> >
> > The corresponding list for each lgrp will be like this.
> >
> > Thread's lgrp                Order of Allocation in NUMA node
> >   lgrp0             [ numaNode0->numaNode1->numaNode2->numaNode3 ]
> >   lgrp1             [ numaNode1->numaNode0->numaNode2->numaNode3 ]
> >   lgrp2             [ numaNode2->numaNode0->numaNode1->numaNode3 ]
> >   lgrp3             [ numaNode3->numaNode0->numaNode1->numaNode2 ]
> >   lgrp4             [ numaNode4->numaNode5->numaNode6->numaNode7 ]
> >   lgrp5             [ numaNode5->numaNode4->numaNode6->numaNode7 ]
> >   lgrp6             [ numaNode6->numaNode4->numaNode5->numaNode7 ]
> >   lgrp7             [ numaNode7->numaNode4->numaNode5->numaNode6 ]
> >
> >   #################################################### PATCH
> > ################################
>
>   could you send me a patch as webrev so I can put it on
> cr.openjdk.java.net? (Or maybe sending the patch as attachment helps
> too). It got mangled by your email program, adding many linebreaks.
>
Please find patch as attachment.

I have run specjbb2015 composite run with "numactl -N 0 <composite
program>" on 8 NUMA node system.
it has improved score
 composite Max-jOPS: +29%
 composite Critical-jOPS: +24%

> Thanks,
>   Thomas
Thanks,
Roshan Mangal