RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Tue May 30 11:59:33 UTC 2017

Hi David and Gustavo,

Thanks for the review.

Webrev is updated according to your comments:

http://cr.openjdk.java.net/~zgu/8181055/webrev.02/

Thanks,

-Zhengyu

On 05/29/2017 07:06 PM, Gustavo Romero wrote:
> Hi David,
>
> On 29-05-2017 01:34, David Holmes wrote:
>> Hi Zhengyu,
>>
>> On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
>>> Hi Gustavo,
>>>
>>> Thanks for the detail analysis and suggestion. I did not realize the difference between from bitmask and nodemask.
>>>
>>> As you suggested, numa_interleave_memory_v2 works under this configuration.
>>>
>>> Please updated Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.01/
>>
>> The addition of support for the "v2" API seems okay. Though I think this comment needs some clarification for the existing code:
>>
>> 2837 // If we are running with libnuma version > 2, then we should
>> 2838 // be trying to use symbols with versions 1.1
>> 2839 // If we are running with earlier version, which did not have symbol versions,
>> 2840 // we should use the base version.
>> 2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {
>>
>> given that we now explicitly load the v1.2 symbol if present.
>>
>> Gustavo: can you vouch for the suitability of using the v2 API in all cases, if it exists?
>
> My understanding is that in the transition to API v2 only the usage of
> numa_node_to_cpus() by the JVM will have to be adapted in os::Linux::rebuild_cpu_to_node_map().
> The remaining functions (excluding numa_interleave_memory() as Zhengyu already addressed it)
> preserve the same functionality and signatures [1].
>
> Currently JVM NUMA API requires the following libnuma functions:
>
> 1. numa_node_to_cpus            v1 != v2 (using v1, JVM has to adapt)
> 2. numa_max_node                v1 == v2 (using v1, transition is straightforward)
> 3. numa_num_configured_nodes    v2       (added by gromero: 8175813)
> 4. numa_available               v1 == v2 (using v1, transition is straightforward)
> 5. numa_tonode_memory           v1 == v2 (using v1, transition is straightforward)
> 6. numa_interleave_memory       v1 != v2 (updated by zhengyu: 8181055. Default use of v2, fallback to v1)
> 7. numa_set_bind_policy         v1 == v2 (using v1, transition is straightforward)
> 8. numa_bitmask_isbitset        v2       (added by gromero: 8175813)
> 9. numa_distance                v1 == v2 (added by gromero: 8175813. Using v1, transition is straightforward)
>
> v1 != v2: function signature in version 1 is different from version 2
> v1 == v2: function signature in version 1 is equal to version 2
> v2      : function is only present in API v2
>
> Thus, to the best of my knowledge, except for case 1. (which JVM need to adapt to)
> all other cases are suitable to use v2 API and we could use a fallback mechanism as
> proposed by Zhengyu or update directly to API v2 (risky?), given that I can't see
> how v2 API would not be available on current (not-EOL) Linux distro releases.
>
> Regarding the comment, I agree, it needs an update since we are not tied anymore
> to version 1.1 (we are in effect already using v2 for some functions). We could
> delete the comment atop libnuma_dlsym() and add something like:
>
> "Handle request to load libnuma symbol version 1.1 (API v1). If it fails load symbol from base version instead."
>
> and to libnuma_v2_dlsym() add:
>
> "Handle request to load libnuma symbol version 1.2 (API v2) only. If it fails no symbol from any other version - even if present - is loaded."
>
> I've opened a bug to track the transitions to API v2 (I also discussed that with Volker):
> https://bugs.openjdk.java.net/browse/JDK-8181196
>
>
> Regards,
> Gustavo
>
> [1] API v1 vs API v2:
>
> API v1
> ======
>
> int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
> int numa_max_node(void);
> - int numa_num_configured_nodes(void);
> int numa_available(void);
> void numa_tonode_memory(void *start, size_t size, int node);
> void numa_interleave_memory(void *start, size_t size, nodemask_t *nodemask);
> void numa_set_bind_policy(int strict);
> - int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
> int numa_distance(int node1, int node2);
>
>
> API v2
> ======
>
> int numa_node_to_cpus(int node, struct bitmask *mask);
> int numa_max_node(void);
> int numa_num_configured_nodes(void);
> int numa_available(void);
> void numa_tonode_memory(void *start, size_t size, int node);
> void numa_interleave_memory(void *start, size_t size, struct bitmask *nodemask);
> void numa_set_bind_policy(int strict)
> int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
> int numa_distance(int node1, int node2);
>
>
>> I'm running this through JPRT now.
>>
>> Thanks,
>> David
>>
>>>
>>> Thanks,
>>>
>>> -Zhengyu
>>>
>>>
>>>
>>> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>>>> Hi Zhengyu,
>>>>
>>>> Thanks a lot for taking care of this corner case on PPC64.
>>>>
>>>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>>>> This is a quick way to kill the symptom (or low risk?). I am not sure if disabling NUMA is a better solution for this circumstance? does 1 NUMA node = UMA?
>>>>
>>>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In the POWER7
>>>> machine you found the corner case (I copy below the data you provided in the
>>>> JBS - thanks for the additional information):
>>>>
>>>> $ numactl -H
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>> node 0 size: 0 MB
>>>> node 0 free: 0 MB
>>>> node 1 cpus:
>>>> node 1 size: 7680 MB
>>>> node 1 free: 1896 MB
>>>> node distances:
>>>> node 0 1
>>>>    0: 10 40
>>>>    1: 40 10
>>>>
>>>> CPUs in node0 have no other alternative besides allocating memory from node1. In
>>>> that case CPUs in node0 are always accessing remote memory from node1 in a constant
>>>> distance (40), so in that case we could say that 1 NUMA (configured) node == UMA.
>>>> Nonetheless, if you add CPUs in node1 (by filling up the other socket present in
>>>> the board) you will end up with CPUs with different distances from the node that
>>>> has configured memory (in that case, node1), so it yields a configuration where
>>>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to a single
>>>> value).
>>>>
>>>> On the other hand, the POWER7 machine configuration in question is bad (and
>>>> rare). It's indeed impacting the whole system performance and it would be
>>>> reasonable to open the machine and move the memory module from bank related to
>>>> node1 to bank related to node0, because all CPUs are accessing remote memory
>>>> without any apparent necessity. Once you change it all CPUs will have local
>>>> memory (distance = 10).
>>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>> -Zhengyu
>>>>>
>>>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>>>> Hi,
>>>>>>
>>>>>> There is a corner case that still failed after JDK-8175813.
>>>>>>
>>>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>>>> result "mbind: Invalid argument" message.
>>>>>>
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>>>
>>>> Looks like that even for that POWER7 rare numa topology numa_interleave_memory()
>>>> should succeed without "mbind: Invalid argument" since the 'mask' argument
>>>> should be already a mask with only nodes from which memory can be allocated, i.e.
>>>> only a mask of configured nodes (even if mask contains only one configured node,
>>>> as in http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>>>
>>>> Inspecting a little bit more, it looks like that the problem boils down to the
>>>> fact that the JVM is passing to numa_interleave_memory() 'numa_all_nodes' [1] in
>>>> Linux::numa_interleave_memory().
>>>>
>>>> One would expect that 'numa_all_nodes' (which is api v1) would track the same
>>>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is a subtle but
>>>> important difference:
>>>>
>>>> 'numa_all_nodes' is constructed assuming a consecutive node distribution [3]:
>>>>
>>>> 100         max = numa_num_configured_nodes();
>>>> 101         for (i = 0; i < max; i++)
>>>> 102                 nodemask_set_compat((nodemask_t *)&numa_all_nodes, i);
>>>>
>>>>
>>>> whilst 'numa_all_nodes_ptr' is constructed parsing /proc/self/status [4]:
>>>>
>>>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>>>> 500                         numprocnode = read_mask(mask, numa_all_nodes_ptr);
>>>>
>>>> Thus for a topology like:
>>>>
>>>> available: 4 nodes (0-1,16-17)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 130706 MB
>>>> node 0 free: 145 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 0 MB
>>>> node 1 free: 0 MB
>>>> node 16 cpus: 80 88 96 104 112
>>>> node 16 size: 130630 MB
>>>> node 16 free: 529 MB
>>>> node 17 cpus: 120 128 136 144 152
>>>> node 17 size: 0 MB
>>>> node 17 free: 0 MB
>>>> node distances:
>>>> node 0 1 16 17
>>>> 0: 10 20 40 40
>>>> 1: 20 10 40 40
>>>> 16: 40 40 10 20
>>>> 17: 40 40 20 10
>>>>
>>>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>>>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>>>
>>>> (Please, see details in the following gdb log: http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>>>
>>>> In that case passing node0 and node1, although being suboptimal, does not bother
>>>> mbind() since the following is satisfied:
>>>>
>>>> "[nodemask] must contain at least one node that is on-line, allowed by the
>>>> process's current cpuset context, and contains memory."
>>>>
>>>> So back to the POWER7 case, I suppose that for:
>>>>
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>> node 0 size: 0 MB
>>>> node 0 free: 0 MB
>>>> node 1 cpus:
>>>> node 1 size: 7680 MB
>>>> node 1 free: 1896 MB
>>>> node distances:
>>>> node 0 1
>>>> 0: 10 40
>>>> 1: 40 10
>>>>
>>>> numa_all_nodes=0x1         => 0b01 (node0)
>>>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>>>
>>>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0), which contains
>>>> indeed no memory. That said, I don't know for sure if passing just node1 in the
>>>> 'nodemask' will satisfy mbind() as in that case there are no cpus available in
>>>> node1.
>>>>
>>>> In summing up, looks like that the root cause is not that numa_interleave_memory()
>>>> does not accept only one configured node, but that the configured node being
>>>> passed is wrong. I could not find a similar numa topology in my poll to test
>>>> more, but it might be worth trying to write a small test using api v2 and
>>>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how numa_interleave_memory()
>>>> goes in that machine :) If it behaves well, updating to api v2 would be a
>>>> solution.
>>>>
>>>> HTH
>>>>
>>>> Regards,
>>>> Gustavo
>>>>
>>>>
>>>> [1] http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
>>>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes with memory from which the calling process can allocate."
>>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>>>> [4] https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>>>
>>>>
>>>>>>
>>>>>> The system NUMA configuration:
>>>>>>
>>>>>> Architecture: ppc64
>>>>>> CPU op-mode(s): 32-bit, 64-bit
>>>>>> Byte Order: Big Endian
>>>>>> CPU(s): 8
>>>>>> On-line CPU(s) list: 0-7
>>>>>> Thread(s) per core: 4
>>>>>> Core(s) per socket: 1
>>>>>> Socket(s): 2
>>>>>> NUMA node(s): 2
>>>>>> Model: 2.1 (pvr 003f 0201)
>>>>>> Model name: POWER7 (architected), altivec supported
>>>>>> L1d cache: 32K
>>>>>> L1i cache: 32K
>>>>>> NUMA node0 CPU(s): 0-7
>>>>>> NUMA node1 CPU(s):
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Zhengyu
>>>>>
>>>>
>>
>