RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Mon May 29 04:34:28 UTC 2017

Hi Zhengyu,

On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
> Hi Gustavo,
> 
> Thanks for the detail analysis and suggestion. I did not realize the 
> difference between from bitmask and nodemask.
> 
> As you suggested, numa_interleave_memory_v2 works under this configuration.
> 
> Please updated Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.01/

The addition of support for the "v2" API seems okay. Though I think this 
comment needs some clarification for the existing code:

2837 // If we are running with libnuma version > 2, then we should
2838 // be trying to use symbols with versions 1.1
2839 // If we are running with earlier version, which did not have 
symbol versions,
2840 // we should use the base version.
2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {

given that we now explicitly load the v1.2 symbol if present.

Gustavo: can you vouch for the suitability of using the v2 API in all 
cases, if it exists?

I'm running this through JPRT now.

Thanks,
David

> 
> Thanks,
> 
> -Zhengyu
> 
> 
> 
> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>> Hi Zhengyu,
>>
>> Thanks a lot for taking care of this corner case on PPC64.
>>
>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>> This is a quick way to kill the symptom (or low risk?). I am not sure 
>>> if disabling NUMA is a better solution for this circumstance? does 1 
>>> NUMA node = UMA?
>>
>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In the 
>> POWER7
>> machine you found the corner case (I copy below the data you provided 
>> in the
>> JBS - thanks for the additional information):
>>
>> $ numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 1 cpus:
>> node 1 size: 7680 MB
>> node 1 free: 1896 MB
>> node distances:
>> node 0 1
>>    0: 10 40
>>    1: 40 10
>>
>> CPUs in node0 have no other alternative besides allocating memory from 
>> node1. In
>> that case CPUs in node0 are always accessing remote memory from node1 
>> in a constant
>> distance (40), so in that case we could say that 1 NUMA (configured) 
>> node == UMA.
>> Nonetheless, if you add CPUs in node1 (by filling up the other socket 
>> present in
>> the board) you will end up with CPUs with different distances from the 
>> node that
>> has configured memory (in that case, node1), so it yields a 
>> configuration where
>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to a 
>> single
>> value).
>>
>> On the other hand, the POWER7 machine configuration in question is bad 
>> (and
>> rare). It's indeed impacting the whole system performance and it would be
>> reasonable to open the machine and move the memory module from bank 
>> related to
>> node1 to bank related to node0, because all CPUs are accessing remote 
>> memory
>> without any apparent necessity. Once you change it all CPUs will have 
>> local
>> memory (distance = 10).
>>
>>
>>> Thanks,
>>>
>>> -Zhengyu
>>>
>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>> Hi,
>>>>
>>>> There is a corner case that still failed after JDK-8175813.
>>>>
>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>> result "mbind: Invalid argument" message.
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>
>> Looks like that even for that POWER7 rare numa topology 
>> numa_interleave_memory()
>> should succeed without "mbind: Invalid argument" since the 'mask' 
>> argument
>> should be already a mask with only nodes from which memory can be 
>> allocated, i.e.
>> only a mask of configured nodes (even if mask contains only one 
>> configured node,
>> as in http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>
>> Inspecting a little bit more, it looks like that the problem boils 
>> down to the
>> fact that the JVM is passing to numa_interleave_memory() 
>> 'numa_all_nodes' [1] in
>> Linux::numa_interleave_memory().
>>
>> One would expect that 'numa_all_nodes' (which is api v1) would track 
>> the same
>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is a 
>> subtle but
>> important difference:
>>
>> 'numa_all_nodes' is constructed assuming a consecutive node 
>> distribution [3]:
>>
>> 100         max = numa_num_configured_nodes();
>> 101         for (i = 0; i < max; i++)
>> 102                 nodemask_set_compat((nodemask_t *)&numa_all_nodes, 
>> i);
>>
>>
>> whilst 'numa_all_nodes_ptr' is constructed parsing /proc/self/status [4]:
>>
>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>> 500                         numprocnode = read_mask(mask, 
>> numa_all_nodes_ptr);
>>
>> Thus for a topology like:
>>
>> available: 4 nodes (0-1,16-17)
>> node 0 cpus: 0 8 16 24 32
>> node 0 size: 130706 MB
>> node 0 free: 145 MB
>> node 1 cpus: 40 48 56 64 72
>> node 1 size: 0 MB
>> node 1 free: 0 MB
>> node 16 cpus: 80 88 96 104 112
>> node 16 size: 130630 MB
>> node 16 free: 529 MB
>> node 17 cpus: 120 128 136 144 152
>> node 17 size: 0 MB
>> node 17 free: 0 MB
>> node distances:
>> node 0 1 16 17
>> 0: 10 20 40 40
>> 1: 20 10 40 40
>> 16: 40 40 10 20
>> 17: 40 40 20 10
>>
>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>
>> (Please, see details in the following gdb log: 
>> http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>
>> In that case passing node0 and node1, although being suboptimal, does 
>> not bother
>> mbind() since the following is satisfied:
>>
>> "[nodemask] must contain at least one node that is on-line, allowed by 
>> the
>> process's current cpuset context, and contains memory."
>>
>> So back to the POWER7 case, I suppose that for:
>>
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 4 5 6 7
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 1 cpus:
>> node 1 size: 7680 MB
>> node 1 free: 1896 MB
>> node distances:
>> node 0 1
>> 0: 10 40
>> 1: 40 10
>>
>> numa_all_nodes=0x1         => 0b01 (node0)
>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>
>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0), which 
>> contains
>> indeed no memory. That said, I don't know for sure if passing just 
>> node1 in the
>> 'nodemask' will satisfy mbind() as in that case there are no cpus 
>> available in
>> node1.
>>
>> In summing up, looks like that the root cause is not that 
>> numa_interleave_memory()
>> does not accept only one configured node, but that the configured node 
>> being
>> passed is wrong. I could not find a similar numa topology in my poll 
>> to test
>> more, but it might be worth trying to write a small test using api v2 and
>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how 
>> numa_interleave_memory()
>> goes in that machine :) If it behaves well, updating to api v2 would be a
>> solution.
>>
>> HTH
>>
>> Regards,
>> Gustavo
>>
>>
>> [1] 
>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274 
>>
>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes with 
>> memory from which the calling process can allocate."
>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>> [4] https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>
>>
>>>>
>>>> The system NUMA configuration:
>>>>
>>>> Architecture: ppc64
>>>> CPU op-mode(s): 32-bit, 64-bit
>>>> Byte Order: Big Endian
>>>> CPU(s): 8
>>>> On-line CPU(s) list: 0-7
>>>> Thread(s) per core: 4
>>>> Core(s) per socket: 1
>>>> Socket(s): 2
>>>> NUMA node(s): 2
>>>> Model: 2.1 (pvr 003f 0201)
>>>> Model name: POWER7 (architected), altivec supported
>>>> L1d cache: 32K
>>>> L1i cache: 32K
>>>> NUMA node0 CPU(s): 0-7
>>>> NUMA node1 CPU(s):
>>>>
>>>> Thanks,
>>>>
>>>> -Zhengyu
>>>
>>