RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Mon May 29 02:08:41 UTC 2017

Hi Gustavo,

Thanks for the detail analysis and suggestion. I did not realize the 
difference between from bitmask and nodemask.

As you suggested, numa_interleave_memory_v2 works under this configuration.

Please updated Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.01/

Thanks,

-Zhengyu

On 05/26/2017 08:34 PM, Gustavo Romero wrote:
> Hi Zhengyu,
>
> Thanks a lot for taking care of this corner case on PPC64.
>
> On 26-05-2017 10:41, Zhengyu Gu wrote:
>> This is a quick way to kill the symptom (or low risk?). I am not sure if disabling NUMA is a better solution for this circumstance? does 1 NUMA node = UMA?
>
> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In the POWER7
> machine you found the corner case (I copy below the data you provided in the
> JBS - thanks for the additional information):
>
> $ numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus:
> node 1 size: 7680 MB
> node 1 free: 1896 MB
> node distances:
> node 0 1
>    0: 10 40
>    1: 40 10
>
> CPUs in node0 have no other alternative besides allocating memory from node1. In
> that case CPUs in node0 are always accessing remote memory from node1 in a constant
> distance (40), so in that case we could say that 1 NUMA (configured) node == UMA.
> Nonetheless, if you add CPUs in node1 (by filling up the other socket present in
> the board) you will end up with CPUs with different distances from the node that
> has configured memory (in that case, node1), so it yields a configuration where
> 1 NUMA (configured) != UMA (i.e. distances are not always equal to a single
> value).
>
> On the other hand, the POWER7 machine configuration in question is bad (and
> rare). It's indeed impacting the whole system performance and it would be
> reasonable to open the machine and move the memory module from bank related to
> node1 to bank related to node0, because all CPUs are accessing remote memory
> without any apparent necessity. Once you change it all CPUs will have local
> memory (distance = 10).
>
>
>> Thanks,
>>
>> -Zhengyu
>>
>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>> Hi,
>>>
>>> There is a corner case that still failed after JDK-8175813.
>>>
>>> The system shows that it has multiple NUMA nodes, but only one is
>>> configured. Under this scenario, numa_interleave_memory() call will
>>> result "mbind: Invalid argument" message.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>
> Looks like that even for that POWER7 rare numa topology numa_interleave_memory()
> should succeed without "mbind: Invalid argument" since the 'mask' argument
> should be already a mask with only nodes from which memory can be allocated, i.e.
> only a mask of configured nodes (even if mask contains only one configured node,
> as in http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>
> Inspecting a little bit more, it looks like that the problem boils down to the
> fact that the JVM is passing to numa_interleave_memory() 'numa_all_nodes' [1] in
> Linux::numa_interleave_memory().
>
> One would expect that 'numa_all_nodes' (which is api v1) would track the same
> information as 'numa_all_nodes_ptr' (api v2) [2], however there is a subtle but
> important difference:
>
> 'numa_all_nodes' is constructed assuming a consecutive node distribution [3]:
>
> 100         max = numa_num_configured_nodes();
> 101         for (i = 0; i < max; i++)
> 102                 nodemask_set_compat((nodemask_t *)&numa_all_nodes, i);
>
>
> whilst 'numa_all_nodes_ptr' is constructed parsing /proc/self/status [4]:
>
> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
> 500                         numprocnode = read_mask(mask, numa_all_nodes_ptr);
>
> Thus for a topology like:
>
> available: 4 nodes (0-1,16-17)
> node 0 cpus: 0 8 16 24 32
> node 0 size: 130706 MB
> node 0 free: 145 MB
> node 1 cpus: 40 48 56 64 72
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 16 cpus: 80 88 96 104 112
> node 16 size: 130630 MB
> node 16 free: 529 MB
> node 17 cpus: 120 128 136 144 152
> node 17 size: 0 MB
> node 17 free: 0 MB
> node distances:
> node 0 1 16 17
> 0: 10 20 40 40
> 1: 20 10 40 40
> 16: 40 40 10 20
> 17: 40 40 20 10
>
> numa_all_nodes=0x3         => 0b11                (node0 and node1)
> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>
> (Please, see details in the following gdb log: http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>
> In that case passing node0 and node1, although being suboptimal, does not bother
> mbind() since the following is satisfied:
>
> "[nodemask] must contain at least one node that is on-line, allowed by the
> process's current cpuset context, and contains memory."
>
> So back to the POWER7 case, I suppose that for:
>
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus:
> node 1 size: 7680 MB
> node 1 free: 1896 MB
> node distances:
> node 0 1
> 0: 10 40
> 1: 40 10
>
> numa_all_nodes=0x1         => 0b01 (node0)
> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>
> and hence numa_interleave_memory() gets nodemask = 0x1 (node0), which contains
> indeed no memory. That said, I don't know for sure if passing just node1 in the
> 'nodemask' will satisfy mbind() as in that case there are no cpus available in
> node1.
>
> In summing up, looks like that the root cause is not that numa_interleave_memory()
> does not accept only one configured node, but that the configured node being
> passed is wrong. I could not find a similar numa topology in my poll to test
> more, but it might be worth trying to write a small test using api v2 and
> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how numa_interleave_memory()
> goes in that machine :) If it behaves well, updating to api v2 would be a
> solution.
>
> HTH
>
> Regards,
> Gustavo
>
>
> [1] http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes with memory from which the calling process can allocate."
> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
> [4] https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>  	
>
>>>
>>> The system NUMA configuration:
>>>
>>> Architecture: ppc64
>>> CPU op-mode(s): 32-bit, 64-bit
>>> Byte Order: Big Endian
>>> CPU(s): 8
>>> On-line CPU(s) list: 0-7
>>> Thread(s) per core: 4
>>> Core(s) per socket: 1
>>> Socket(s): 2
>>> NUMA node(s): 2
>>> Model: 2.1 (pvr 003f 0201)
>>> Model name: POWER7 (architected), altivec supported
>>> L1d cache: 32K
>>> L1i cache: 32K
>>> NUMA node0 CPU(s): 0-7
>>> NUMA node1 CPU(s):
>>>
>>> Thanks,
>>>
>>> -Zhengyu
>>
>