RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Sat May 27 00:34:50 UTC 2017

Hi Zhengyu,

Thanks a lot for taking care of this corner case on PPC64.

On 26-05-2017 10:41, Zhengyu Gu wrote:
> This is a quick way to kill the symptom (or low risk?). I am not sure if disabling NUMA is a better solution for this circumstance? does 1 NUMA node = UMA?

On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In the POWER7
machine you found the corner case (I copy below the data you provided in the
JBS - thanks for the additional information):

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus:
node 1 size: 7680 MB
node 1 free: 1896 MB
node distances:
node 0 1
   0: 10 40
   1: 40 10

CPUs in node0 have no other alternative besides allocating memory from node1. In
that case CPUs in node0 are always accessing remote memory from node1 in a constant
distance (40), so in that case we could say that 1 NUMA (configured) node == UMA.
Nonetheless, if you add CPUs in node1 (by filling up the other socket present in
the board) you will end up with CPUs with different distances from the node that
has configured memory (in that case, node1), so it yields a configuration where
1 NUMA (configured) != UMA (i.e. distances are not always equal to a single
value).

On the other hand, the POWER7 machine configuration in question is bad (and
rare). It's indeed impacting the whole system performance and it would be
reasonable to open the machine and move the memory module from bank related to
node1 to bank related to node0, because all CPUs are accessing remote memory
without any apparent necessity. Once you change it all CPUs will have local
memory (distance = 10).

> Thanks,
> 
> -Zhengyu
> 
> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>> Hi,
>>
>> There is a corner case that still failed after JDK-8175813.
>>
>> The system shows that it has multiple NUMA nodes, but only one is
>> configured. Under this scenario, numa_interleave_memory() call will
>> result "mbind: Invalid argument" message.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/

Looks like that even for that POWER7 rare numa topology numa_interleave_memory()
should succeed without "mbind: Invalid argument" since the 'mask' argument
should be already a mask with only nodes from which memory can be allocated, i.e.
only a mask of configured nodes (even if mask contains only one configured node,
as in http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).

Inspecting a little bit more, it looks like that the problem boils down to the
fact that the JVM is passing to numa_interleave_memory() 'numa_all_nodes' [1] in
Linux::numa_interleave_memory().

One would expect that 'numa_all_nodes' (which is api v1) would track the same
information as 'numa_all_nodes_ptr' (api v2) [2], however there is a subtle but
important difference:

'numa_all_nodes' is constructed assuming a consecutive node distribution [3]:

100         max = numa_num_configured_nodes();
101         for (i = 0; i < max; i++)
102                 nodemask_set_compat((nodemask_t *)&numa_all_nodes, i);

whilst 'numa_all_nodes_ptr' is constructed parsing /proc/self/status [4]:

499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
500                         numprocnode = read_mask(mask, numa_all_nodes_ptr);

Thus for a topology like:

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 8 16 24 32
node 0 size: 130706 MB
node 0 free: 145 MB
node 1 cpus: 40 48 56 64 72
node 1 size: 0 MB
node 1 free: 0 MB
node 16 cpus: 80 88 96 104 112
node 16 size: 130630 MB
node 16 free: 529 MB
node 17 cpus: 120 128 136 144 152
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node 0 1 16 17
0: 10 20 40 40
1: 20 10 40 40
16: 40 40 10 20
17: 40 40 20 10

numa_all_nodes=0x3         => 0b11                (node0 and node1)
numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)

(Please, see details in the following gdb log: http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)

In that case passing node0 and node1, although being suboptimal, does not bother
mbind() since the following is satisfied:

"[nodemask] must contain at least one node that is on-line, allowed by the
process's current cpuset context, and contains memory."

So back to the POWER7 case, I suppose that for:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus:
node 1 size: 7680 MB
node 1 free: 1896 MB
node distances:
node 0 1
0: 10 40
1: 40 10

numa_all_nodes=0x1         => 0b01 (node0)
numa_all_nodes_ptr=0x2     => 0b10 (node1)

and hence numa_interleave_memory() gets nodemask = 0x1 (node0), which contains
indeed no memory. That said, I don't know for sure if passing just node1 in the
'nodemask' will satisfy mbind() as in that case there are no cpus available in
node1.

In summing up, looks like that the root cause is not that numa_interleave_memory()
does not accept only one configured node, but that the configured node being
passed is wrong. I could not find a similar numa topology in my poll to test
more, but it might be worth trying to write a small test using api v2 and
'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how numa_interleave_memory()
goes in that machine :) If it behaves well, updating to api v2 would be a
solution.

HTH

Regards,
Gustavo

[1] http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
[2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes with memory from which the calling process can allocate."
[3] https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
[4] https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500

>>
>> The system NUMA configuration:
>>
>> Architecture: ppc64
>> CPU op-mode(s): 32-bit, 64-bit
>> Byte Order: Big Endian
>> CPU(s): 8
>> On-line CPU(s) list: 0-7
>> Thread(s) per core: 4
>> Core(s) per socket: 1
>> Socket(s): 2
>> NUMA node(s): 2
>> Model: 2.1 (pvr 003f 0201)
>> Model name: POWER7 (architected), altivec supported
>> L1d cache: 32K
>> L1i cache: 32K
>> NUMA node0 CPU(s): 0-7
>> NUMA node1 CPU(s):
>>
>> Thanks,
>>
>> -Zhengyu
>