RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Wed May 31 12:47:07 UTC 2017

Hi Zhengyu,

On 30-05-2017 21:37, Zhengyu Gu wrote:
> Hi David,
> 
> Thanks for the review.
> 
> Gustavo, might I count you as a reviewer?

Formally speaking (accordingly to the community Bylaws) I'm not a reviewer, so
I guess no.

Kind regards,
Gustavo

> Thanks,
> 
> -Zhengyu
> 
> 
> 
> On 05/30/2017 05:30 PM, David Holmes wrote:
>> Looks fine to me.
>>
>> Thanks,
>> David
>>
>> On 30/05/2017 9:59 PM, Zhengyu Gu wrote:
>>> Hi David and Gustavo,
>>>
>>> Thanks for the review.
>>>
>>> Webrev is updated according to your comments:
>>>
>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.02/
>>>
>>> Thanks,
>>>
>>> -Zhengyu
>>>
>>>
>>> On 05/29/2017 07:06 PM, Gustavo Romero wrote:
>>>> Hi David,
>>>>
>>>> On 29-05-2017 01:34, David Holmes wrote:
>>>>> Hi Zhengyu,
>>>>>
>>>>> On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
>>>>>> Hi Gustavo,
>>>>>>
>>>>>> Thanks for the detail analysis and suggestion. I did not realize
>>>>>> the difference between from bitmask and nodemask.
>>>>>>
>>>>>> As you suggested, numa_interleave_memory_v2 works under this
>>>>>> configuration.
>>>>>>
>>>>>> Please updated Webrev:
>>>>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.01/
>>>>>
>>>>> The addition of support for the "v2" API seems okay. Though I think
>>>>> this comment needs some clarification for the existing code:
>>>>>
>>>>> 2837 // If we are running with libnuma version > 2, then we should
>>>>> 2838 // be trying to use symbols with versions 1.1
>>>>> 2839 // If we are running with earlier version, which did not have
>>>>> symbol versions,
>>>>> 2840 // we should use the base version.
>>>>> 2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {
>>>>>
>>>>> given that we now explicitly load the v1.2 symbol if present.
>>>>>
>>>>> Gustavo: can you vouch for the suitability of using the v2 API in
>>>>> all cases, if it exists?
>>>>
>>>> My understanding is that in the transition to API v2 only the usage of
>>>> numa_node_to_cpus() by the JVM will have to be adapted in
>>>> os::Linux::rebuild_cpu_to_node_map().
>>>> The remaining functions (excluding numa_interleave_memory() as
>>>> Zhengyu already addressed it)
>>>> preserve the same functionality and signatures [1].
>>>>
>>>> Currently JVM NUMA API requires the following libnuma functions:
>>>>
>>>> 1. numa_node_to_cpus            v1 != v2 (using v1, JVM has to adapt)
>>>> 2. numa_max_node                v1 == v2 (using v1, transition is
>>>> straightforward)
>>>> 3. numa_num_configured_nodes    v2       (added by gromero: 8175813)
>>>> 4. numa_available               v1 == v2 (using v1, transition is
>>>> straightforward)
>>>> 5. numa_tonode_memory           v1 == v2 (using v1, transition is
>>>> straightforward)
>>>> 6. numa_interleave_memory       v1 != v2 (updated by zhengyu:
>>>> 8181055. Default use of v2, fallback to v1)
>>>> 7. numa_set_bind_policy         v1 == v2 (using v1, transition is
>>>> straightforward)
>>>> 8. numa_bitmask_isbitset        v2       (added by gromero: 8175813)
>>>> 9. numa_distance                v1 == v2 (added by gromero: 8175813.
>>>> Using v1, transition is straightforward)
>>>>
>>>> v1 != v2: function signature in version 1 is different from version 2
>>>> v1 == v2: function signature in version 1 is equal to version 2
>>>> v2      : function is only present in API v2
>>>>
>>>> Thus, to the best of my knowledge, except for case 1. (which JVM need
>>>> to adapt to)
>>>> all other cases are suitable to use v2 API and we could use a
>>>> fallback mechanism as
>>>> proposed by Zhengyu or update directly to API v2 (risky?), given that
>>>> I can't see
>>>> how v2 API would not be available on current (not-EOL) Linux distro
>>>> releases.
>>>>
>>>> Regarding the comment, I agree, it needs an update since we are not
>>>> tied anymore
>>>> to version 1.1 (we are in effect already using v2 for some
>>>> functions). We could
>>>> delete the comment atop libnuma_dlsym() and add something like:
>>>>
>>>> "Handle request to load libnuma symbol version 1.1 (API v1). If it
>>>> fails load symbol from base version instead."
>>>>
>>>> and to libnuma_v2_dlsym() add:
>>>>
>>>> "Handle request to load libnuma symbol version 1.2 (API v2) only. If
>>>> it fails no symbol from any other version - even if present - is
>>>> loaded."
>>>>
>>>> I've opened a bug to track the transitions to API v2 (I also
>>>> discussed that with Volker):
>>>> https://bugs.openjdk.java.net/browse/JDK-8181196
>>>>
>>>>
>>>> Regards,
>>>> Gustavo
>>>>
>>>> [1] API v1 vs API v2:
>>>>
>>>> API v1
>>>> ======
>>>>
>>>> int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
>>>> int numa_max_node(void);
>>>> - int numa_num_configured_nodes(void);
>>>> int numa_available(void);
>>>> void numa_tonode_memory(void *start, size_t size, int node);
>>>> void numa_interleave_memory(void *start, size_t size, nodemask_t
>>>> *nodemask);
>>>> void numa_set_bind_policy(int strict);
>>>> - int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>>> int numa_distance(int node1, int node2);
>>>>
>>>>
>>>> API v2
>>>> ======
>>>>
>>>> int numa_node_to_cpus(int node, struct bitmask *mask);
>>>> int numa_max_node(void);
>>>> int numa_num_configured_nodes(void);
>>>> int numa_available(void);
>>>> void numa_tonode_memory(void *start, size_t size, int node);
>>>> void numa_interleave_memory(void *start, size_t size, struct bitmask
>>>> *nodemask);
>>>> void numa_set_bind_policy(int strict)
>>>> int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>>> int numa_distance(int node1, int node2);
>>>>
>>>>
>>>>> I'm running this through JPRT now.
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Zhengyu
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>>>>>>> Hi Zhengyu,
>>>>>>>
>>>>>>> Thanks a lot for taking care of this corner case on PPC64.
>>>>>>>
>>>>>>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>>>>>>> This is a quick way to kill the symptom (or low risk?). I am not
>>>>>>>> sure if disabling NUMA is a better solution for this
>>>>>>>> circumstance? does 1 NUMA node = UMA?
>>>>>>>
>>>>>>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In
>>>>>>> the POWER7
>>>>>>> machine you found the corner case (I copy below the data you
>>>>>>> provided in the
>>>>>>> JBS - thanks for the additional information):
>>>>>>>
>>>>>>> $ numactl -H
>>>>>>> available: 2 nodes (0-1)
>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>>> node 0 size: 0 MB
>>>>>>> node 0 free: 0 MB
>>>>>>> node 1 cpus:
>>>>>>> node 1 size: 7680 MB
>>>>>>> node 1 free: 1896 MB
>>>>>>> node distances:
>>>>>>> node 0 1
>>>>>>>    0: 10 40
>>>>>>>    1: 40 10
>>>>>>>
>>>>>>> CPUs in node0 have no other alternative besides allocating memory
>>>>>>> from node1. In
>>>>>>> that case CPUs in node0 are always accessing remote memory from
>>>>>>> node1 in a constant
>>>>>>> distance (40), so in that case we could say that 1 NUMA
>>>>>>> (configured) node == UMA.
>>>>>>> Nonetheless, if you add CPUs in node1 (by filling up the other
>>>>>>> socket present in
>>>>>>> the board) you will end up with CPUs with different distances from
>>>>>>> the node that
>>>>>>> has configured memory (in that case, node1), so it yields a
>>>>>>> configuration where
>>>>>>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to
>>>>>>> a single
>>>>>>> value).
>>>>>>>
>>>>>>> On the other hand, the POWER7 machine configuration in question is
>>>>>>> bad (and
>>>>>>> rare). It's indeed impacting the whole system performance and it
>>>>>>> would be
>>>>>>> reasonable to open the machine and move the memory module from
>>>>>>> bank related to
>>>>>>> node1 to bank related to node0, because all CPUs are accessing
>>>>>>> remote memory
>>>>>>> without any apparent necessity. Once you change it all CPUs will
>>>>>>> have local
>>>>>>> memory (distance = 10).
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -Zhengyu
>>>>>>>>
>>>>>>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> There is a corner case that still failed after JDK-8175813.
>>>>>>>>>
>>>>>>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>>>>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>>>>>>> result "mbind: Invalid argument" message.
>>>>>>>>>
>>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>>>>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>>>>>>
>>>>>>> Looks like that even for that POWER7 rare numa topology
>>>>>>> numa_interleave_memory()
>>>>>>> should succeed without "mbind: Invalid argument" since the 'mask'
>>>>>>> argument
>>>>>>> should be already a mask with only nodes from which memory can be
>>>>>>> allocated, i.e.
>>>>>>> only a mask of configured nodes (even if mask contains only one
>>>>>>> configured node,
>>>>>>> as in
>>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>>>>>>
>>>>>>> Inspecting a little bit more, it looks like that the problem boils
>>>>>>> down to the
>>>>>>> fact that the JVM is passing to numa_interleave_memory()
>>>>>>> 'numa_all_nodes' [1] in
>>>>>>> Linux::numa_interleave_memory().
>>>>>>>
>>>>>>> One would expect that 'numa_all_nodes' (which is api v1) would
>>>>>>> track the same
>>>>>>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is
>>>>>>> a subtle but
>>>>>>> important difference:
>>>>>>>
>>>>>>> 'numa_all_nodes' is constructed assuming a consecutive node
>>>>>>> distribution [3]:
>>>>>>>
>>>>>>> 100         max = numa_num_configured_nodes();
>>>>>>> 101         for (i = 0; i < max; i++)
>>>>>>> 102                 nodemask_set_compat((nodemask_t
>>>>>>> *)&numa_all_nodes, i);
>>>>>>>
>>>>>>>
>>>>>>> whilst 'numa_all_nodes_ptr' is constructed parsing
>>>>>>> /proc/self/status [4]:
>>>>>>>
>>>>>>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>>>>>>> 500                         numprocnode = read_mask(mask,
>>>>>>> numa_all_nodes_ptr);
>>>>>>>
>>>>>>> Thus for a topology like:
>>>>>>>
>>>>>>> available: 4 nodes (0-1,16-17)
>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>> node 0 size: 130706 MB
>>>>>>> node 0 free: 145 MB
>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>> node 1 size: 0 MB
>>>>>>> node 1 free: 0 MB
>>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>>> node 16 size: 130630 MB
>>>>>>> node 16 free: 529 MB
>>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>>> node 17 size: 0 MB
>>>>>>> node 17 free: 0 MB
>>>>>>> node distances:
>>>>>>> node 0 1 16 17
>>>>>>> 0: 10 20 40 40
>>>>>>> 1: 20 10 40 40
>>>>>>> 16: 40 40 10 20
>>>>>>> 17: 40 40 20 10
>>>>>>>
>>>>>>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>>>>>>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>>>>>>
>>>>>>> (Please, see details in the following gdb log:
>>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>>>>>>
>>>>>>> In that case passing node0 and node1, although being suboptimal,
>>>>>>> does not bother
>>>>>>> mbind() since the following is satisfied:
>>>>>>>
>>>>>>> "[nodemask] must contain at least one node that is on-line,
>>>>>>> allowed by the
>>>>>>> process's current cpuset context, and contains memory."
>>>>>>>
>>>>>>> So back to the POWER7 case, I suppose that for:
>>>>>>>
>>>>>>> available: 2 nodes (0-1)
>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>>> node 0 size: 0 MB
>>>>>>> node 0 free: 0 MB
>>>>>>> node 1 cpus:
>>>>>>> node 1 size: 7680 MB
>>>>>>> node 1 free: 1896 MB
>>>>>>> node distances:
>>>>>>> node 0 1
>>>>>>> 0: 10 40
>>>>>>> 1: 40 10
>>>>>>>
>>>>>>> numa_all_nodes=0x1         => 0b01 (node0)
>>>>>>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>>>>>>
>>>>>>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0),
>>>>>>> which contains
>>>>>>> indeed no memory. That said, I don't know for sure if passing just
>>>>>>> node1 in the
>>>>>>> 'nodemask' will satisfy mbind() as in that case there are no cpus
>>>>>>> available in
>>>>>>> node1.
>>>>>>>
>>>>>>> In summing up, looks like that the root cause is not that
>>>>>>> numa_interleave_memory()
>>>>>>> does not accept only one configured node, but that the configured
>>>>>>> node being
>>>>>>> passed is wrong. I could not find a similar numa topology in my
>>>>>>> poll to test
>>>>>>> more, but it might be worth trying to write a small test using api
>>>>>>> v2 and
>>>>>>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how
>>>>>>> numa_interleave_memory()
>>>>>>> goes in that machine :) If it behaves well, updating to api v2
>>>>>>> would be a
>>>>>>> solution.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gustavo
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
>>>>>>>
>>>>>>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes
>>>>>>> with memory from which the calling process can allocate."
>>>>>>> [3]
>>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>>>>>>> [4]
>>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> The system NUMA configuration:
>>>>>>>>>
>>>>>>>>> Architecture: ppc64
>>>>>>>>> CPU op-mode(s): 32-bit, 64-bit
>>>>>>>>> Byte Order: Big Endian
>>>>>>>>> CPU(s): 8
>>>>>>>>> On-line CPU(s) list: 0-7
>>>>>>>>> Thread(s) per core: 4
>>>>>>>>> Core(s) per socket: 1
>>>>>>>>> Socket(s): 2
>>>>>>>>> NUMA node(s): 2
>>>>>>>>> Model: 2.1 (pvr 003f 0201)
>>>>>>>>> Model name: POWER7 (architected), altivec supported
>>>>>>>>> L1d cache: 32K
>>>>>>>>> L1i cache: 32K
>>>>>>>>> NUMA node0 CPU(s): 0-7
>>>>>>>>> NUMA node1 CPU(s):
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -Zhengyu
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>