RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Mon May 29 23:06:37 UTC 2017

Hi David,

On 29-05-2017 01:34, David Holmes wrote:
> Hi Zhengyu,
> 
> On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
>> Hi Gustavo,
>>
>> Thanks for the detail analysis and suggestion. I did not realize the difference between from bitmask and nodemask.
>>
>> As you suggested, numa_interleave_memory_v2 works under this configuration.
>>
>> Please updated Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.01/
> 
> The addition of support for the "v2" API seems okay. Though I think this comment needs some clarification for the existing code:
> 
> 2837 // If we are running with libnuma version > 2, then we should
> 2838 // be trying to use symbols with versions 1.1
> 2839 // If we are running with earlier version, which did not have symbol versions,
> 2840 // we should use the base version.
> 2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {
> 
> given that we now explicitly load the v1.2 symbol if present.
> 
> Gustavo: can you vouch for the suitability of using the v2 API in all cases, if it exists?

My understanding is that in the transition to API v2 only the usage of
numa_node_to_cpus() by the JVM will have to be adapted in os::Linux::rebuild_cpu_to_node_map().
The remaining functions (excluding numa_interleave_memory() as Zhengyu already addressed it)
preserve the same functionality and signatures [1].

Currently JVM NUMA API requires the following libnuma functions:

1. numa_node_to_cpus            v1 != v2 (using v1, JVM has to adapt)
2. numa_max_node                v1 == v2 (using v1, transition is straightforward)
3. numa_num_configured_nodes    v2       (added by gromero: 8175813)
4. numa_available               v1 == v2 (using v1, transition is straightforward)
5. numa_tonode_memory           v1 == v2 (using v1, transition is straightforward)
6. numa_interleave_memory       v1 != v2 (updated by zhengyu: 8181055. Default use of v2, fallback to v1)
7. numa_set_bind_policy         v1 == v2 (using v1, transition is straightforward)
8. numa_bitmask_isbitset        v2       (added by gromero: 8175813)
9. numa_distance                v1 == v2 (added by gromero: 8175813. Using v1, transition is straightforward)

v1 != v2: function signature in version 1 is different from version 2
v1 == v2: function signature in version 1 is equal to version 2
v2      : function is only present in API v2

Thus, to the best of my knowledge, except for case 1. (which JVM need to adapt to)
all other cases are suitable to use v2 API and we could use a fallback mechanism as
proposed by Zhengyu or update directly to API v2 (risky?), given that I can't see
how v2 API would not be available on current (not-EOL) Linux distro releases.

Regarding the comment, I agree, it needs an update since we are not tied anymore
to version 1.1 (we are in effect already using v2 for some functions). We could
delete the comment atop libnuma_dlsym() and add something like:

"Handle request to load libnuma symbol version 1.1 (API v1). If it fails load symbol from base version instead."

and to libnuma_v2_dlsym() add:

"Handle request to load libnuma symbol version 1.2 (API v2) only. If it fails no symbol from any other version - even if present - is loaded."

I've opened a bug to track the transitions to API v2 (I also discussed that with Volker):
https://bugs.openjdk.java.net/browse/JDK-8181196

Regards,
Gustavo

[1] API v1 vs API v2:

API v1
======

int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
int numa_max_node(void);
- int numa_num_configured_nodes(void);
int numa_available(void);
void numa_tonode_memory(void *start, size_t size, int node);
void numa_interleave_memory(void *start, size_t size, nodemask_t *nodemask);
void numa_set_bind_policy(int strict);
- int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
int numa_distance(int node1, int node2);

API v2
======

int numa_node_to_cpus(int node, struct bitmask *mask);
int numa_max_node(void);
int numa_num_configured_nodes(void);
int numa_available(void);
void numa_tonode_memory(void *start, size_t size, int node);
void numa_interleave_memory(void *start, size_t size, struct bitmask *nodemask);
void numa_set_bind_policy(int strict)
int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
int numa_distance(int node1, int node2);

> I'm running this through JPRT now.
> 
> Thanks,
> David
> 
>>
>> Thanks,
>>
>> -Zhengyu
>>
>>
>>
>> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>>> Hi Zhengyu,
>>>
>>> Thanks a lot for taking care of this corner case on PPC64.
>>>
>>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>>> This is a quick way to kill the symptom (or low risk?). I am not sure if disabling NUMA is a better solution for this circumstance? does 1 NUMA node = UMA?
>>>
>>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In the POWER7
>>> machine you found the corner case (I copy below the data you provided in the
>>> JBS - thanks for the additional information):
>>>
>>> $ numactl -H
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 0 MB
>>> node 0 free: 0 MB
>>> node 1 cpus:
>>> node 1 size: 7680 MB
>>> node 1 free: 1896 MB
>>> node distances:
>>> node 0 1
>>>    0: 10 40
>>>    1: 40 10
>>>
>>> CPUs in node0 have no other alternative besides allocating memory from node1. In
>>> that case CPUs in node0 are always accessing remote memory from node1 in a constant
>>> distance (40), so in that case we could say that 1 NUMA (configured) node == UMA.
>>> Nonetheless, if you add CPUs in node1 (by filling up the other socket present in
>>> the board) you will end up with CPUs with different distances from the node that
>>> has configured memory (in that case, node1), so it yields a configuration where
>>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to a single
>>> value).
>>>
>>> On the other hand, the POWER7 machine configuration in question is bad (and
>>> rare). It's indeed impacting the whole system performance and it would be
>>> reasonable to open the machine and move the memory module from bank related to
>>> node1 to bank related to node0, because all CPUs are accessing remote memory
>>> without any apparent necessity. Once you change it all CPUs will have local
>>> memory (distance = 10).
>>>
>>>
>>>> Thanks,
>>>>
>>>> -Zhengyu
>>>>
>>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>>> Hi,
>>>>>
>>>>> There is a corner case that still failed after JDK-8175813.
>>>>>
>>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>>> result "mbind: Invalid argument" message.
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>>
>>> Looks like that even for that POWER7 rare numa topology numa_interleave_memory()
>>> should succeed without "mbind: Invalid argument" since the 'mask' argument
>>> should be already a mask with only nodes from which memory can be allocated, i.e.
>>> only a mask of configured nodes (even if mask contains only one configured node,
>>> as in http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>>
>>> Inspecting a little bit more, it looks like that the problem boils down to the
>>> fact that the JVM is passing to numa_interleave_memory() 'numa_all_nodes' [1] in
>>> Linux::numa_interleave_memory().
>>>
>>> One would expect that 'numa_all_nodes' (which is api v1) would track the same
>>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is a subtle but
>>> important difference:
>>>
>>> 'numa_all_nodes' is constructed assuming a consecutive node distribution [3]:
>>>
>>> 100         max = numa_num_configured_nodes();
>>> 101         for (i = 0; i < max; i++)
>>> 102                 nodemask_set_compat((nodemask_t *)&numa_all_nodes, i);
>>>
>>>
>>> whilst 'numa_all_nodes_ptr' is constructed parsing /proc/self/status [4]:
>>>
>>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>>> 500                         numprocnode = read_mask(mask, numa_all_nodes_ptr);
>>>
>>> Thus for a topology like:
>>>
>>> available: 4 nodes (0-1,16-17)
>>> node 0 cpus: 0 8 16 24 32
>>> node 0 size: 130706 MB
>>> node 0 free: 145 MB
>>> node 1 cpus: 40 48 56 64 72
>>> node 1 size: 0 MB
>>> node 1 free: 0 MB
>>> node 16 cpus: 80 88 96 104 112
>>> node 16 size: 130630 MB
>>> node 16 free: 529 MB
>>> node 17 cpus: 120 128 136 144 152
>>> node 17 size: 0 MB
>>> node 17 free: 0 MB
>>> node distances:
>>> node 0 1 16 17
>>> 0: 10 20 40 40
>>> 1: 20 10 40 40
>>> 16: 40 40 10 20
>>> 17: 40 40 20 10
>>>
>>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>>
>>> (Please, see details in the following gdb log: http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>>
>>> In that case passing node0 and node1, although being suboptimal, does not bother
>>> mbind() since the following is satisfied:
>>>
>>> "[nodemask] must contain at least one node that is on-line, allowed by the
>>> process's current cpuset context, and contains memory."
>>>
>>> So back to the POWER7 case, I suppose that for:
>>>
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>> node 0 size: 0 MB
>>> node 0 free: 0 MB
>>> node 1 cpus:
>>> node 1 size: 7680 MB
>>> node 1 free: 1896 MB
>>> node distances:
>>> node 0 1
>>> 0: 10 40
>>> 1: 40 10
>>>
>>> numa_all_nodes=0x1         => 0b01 (node0)
>>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>>
>>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0), which contains
>>> indeed no memory. That said, I don't know for sure if passing just node1 in the
>>> 'nodemask' will satisfy mbind() as in that case there are no cpus available in
>>> node1.
>>>
>>> In summing up, looks like that the root cause is not that numa_interleave_memory()
>>> does not accept only one configured node, but that the configured node being
>>> passed is wrong. I could not find a similar numa topology in my poll to test
>>> more, but it might be worth trying to write a small test using api v2 and
>>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how numa_interleave_memory()
>>> goes in that machine :) If it behaves well, updating to api v2 would be a
>>> solution.
>>>
>>> HTH
>>>
>>> Regards,
>>> Gustavo
>>>
>>>
>>> [1] http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
>>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes with memory from which the calling process can allocate."
>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>>> [4] https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>>
>>>
>>>>>
>>>>> The system NUMA configuration:
>>>>>
>>>>> Architecture: ppc64
>>>>> CPU op-mode(s): 32-bit, 64-bit
>>>>> Byte Order: Big Endian
>>>>> CPU(s): 8
>>>>> On-line CPU(s) list: 0-7
>>>>> Thread(s) per core: 4
>>>>> Core(s) per socket: 1
>>>>> Socket(s): 2
>>>>> NUMA node(s): 2
>>>>> Model: 2.1 (pvr 003f 0201)
>>>>> Model name: POWER7 (architected), altivec supported
>>>>> L1d cache: 32K
>>>>> L1i cache: 32K
>>>>> NUMA node0 CPU(s): 0-7
>>>>> NUMA node1 CPU(s):
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Zhengyu
>>>>
>>>
>