RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Wed May 31 00:37:11 UTC 2017

Hi David,

Thanks for the review.

Gustavo, might I count you as a reviewer?

Thanks,

-Zhengyu

On 05/30/2017 05:30 PM, David Holmes wrote:
> Looks fine to me.
>
> Thanks,
> David
>
> On 30/05/2017 9:59 PM, Zhengyu Gu wrote:
>> Hi David and Gustavo,
>>
>> Thanks for the review.
>>
>> Webrev is updated according to your comments:
>>
>> http://cr.openjdk.java.net/~zgu/8181055/webrev.02/
>>
>> Thanks,
>>
>> -Zhengyu
>>
>>
>> On 05/29/2017 07:06 PM, Gustavo Romero wrote:
>>> Hi David,
>>>
>>> On 29-05-2017 01:34, David Holmes wrote:
>>>> Hi Zhengyu,
>>>>
>>>> On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
>>>>> Hi Gustavo,
>>>>>
>>>>> Thanks for the detail analysis and suggestion. I did not realize
>>>>> the difference between from bitmask and nodemask.
>>>>>
>>>>> As you suggested, numa_interleave_memory_v2 works under this
>>>>> configuration.
>>>>>
>>>>> Please updated Webrev:
>>>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.01/
>>>>
>>>> The addition of support for the "v2" API seems okay. Though I think
>>>> this comment needs some clarification for the existing code:
>>>>
>>>> 2837 // If we are running with libnuma version > 2, then we should
>>>> 2838 // be trying to use symbols with versions 1.1
>>>> 2839 // If we are running with earlier version, which did not have
>>>> symbol versions,
>>>> 2840 // we should use the base version.
>>>> 2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {
>>>>
>>>> given that we now explicitly load the v1.2 symbol if present.
>>>>
>>>> Gustavo: can you vouch for the suitability of using the v2 API in
>>>> all cases, if it exists?
>>>
>>> My understanding is that in the transition to API v2 only the usage of
>>> numa_node_to_cpus() by the JVM will have to be adapted in
>>> os::Linux::rebuild_cpu_to_node_map().
>>> The remaining functions (excluding numa_interleave_memory() as
>>> Zhengyu already addressed it)
>>> preserve the same functionality and signatures [1].
>>>
>>> Currently JVM NUMA API requires the following libnuma functions:
>>>
>>> 1. numa_node_to_cpus            v1 != v2 (using v1, JVM has to adapt)
>>> 2. numa_max_node                v1 == v2 (using v1, transition is
>>> straightforward)
>>> 3. numa_num_configured_nodes    v2       (added by gromero: 8175813)
>>> 4. numa_available               v1 == v2 (using v1, transition is
>>> straightforward)
>>> 5. numa_tonode_memory           v1 == v2 (using v1, transition is
>>> straightforward)
>>> 6. numa_interleave_memory       v1 != v2 (updated by zhengyu:
>>> 8181055. Default use of v2, fallback to v1)
>>> 7. numa_set_bind_policy         v1 == v2 (using v1, transition is
>>> straightforward)
>>> 8. numa_bitmask_isbitset        v2       (added by gromero: 8175813)
>>> 9. numa_distance                v1 == v2 (added by gromero: 8175813.
>>> Using v1, transition is straightforward)
>>>
>>> v1 != v2: function signature in version 1 is different from version 2
>>> v1 == v2: function signature in version 1 is equal to version 2
>>> v2      : function is only present in API v2
>>>
>>> Thus, to the best of my knowledge, except for case 1. (which JVM need
>>> to adapt to)
>>> all other cases are suitable to use v2 API and we could use a
>>> fallback mechanism as
>>> proposed by Zhengyu or update directly to API v2 (risky?), given that
>>> I can't see
>>> how v2 API would not be available on current (not-EOL) Linux distro
>>> releases.
>>>
>>> Regarding the comment, I agree, it needs an update since we are not
>>> tied anymore
>>> to version 1.1 (we are in effect already using v2 for some
>>> functions). We could
>>> delete the comment atop libnuma_dlsym() and add something like:
>>>
>>> "Handle request to load libnuma symbol version 1.1 (API v1). If it
>>> fails load symbol from base version instead."
>>>
>>> and to libnuma_v2_dlsym() add:
>>>
>>> "Handle request to load libnuma symbol version 1.2 (API v2) only. If
>>> it fails no symbol from any other version - even if present - is
>>> loaded."
>>>
>>> I've opened a bug to track the transitions to API v2 (I also
>>> discussed that with Volker):
>>> https://bugs.openjdk.java.net/browse/JDK-8181196
>>>
>>>
>>> Regards,
>>> Gustavo
>>>
>>> [1] API v1 vs API v2:
>>>
>>> API v1
>>> ======
>>>
>>> int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
>>> int numa_max_node(void);
>>> - int numa_num_configured_nodes(void);
>>> int numa_available(void);
>>> void numa_tonode_memory(void *start, size_t size, int node);
>>> void numa_interleave_memory(void *start, size_t size, nodemask_t
>>> *nodemask);
>>> void numa_set_bind_policy(int strict);
>>> - int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>> int numa_distance(int node1, int node2);
>>>
>>>
>>> API v2
>>> ======
>>>
>>> int numa_node_to_cpus(int node, struct bitmask *mask);
>>> int numa_max_node(void);
>>> int numa_num_configured_nodes(void);
>>> int numa_available(void);
>>> void numa_tonode_memory(void *start, size_t size, int node);
>>> void numa_interleave_memory(void *start, size_t size, struct bitmask
>>> *nodemask);
>>> void numa_set_bind_policy(int strict)
>>> int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>> int numa_distance(int node1, int node2);
>>>
>>>
>>>> I'm running this through JPRT now.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Zhengyu
>>>>>
>>>>>
>>>>>
>>>>> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>>>>>> Hi Zhengyu,
>>>>>>
>>>>>> Thanks a lot for taking care of this corner case on PPC64.
>>>>>>
>>>>>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>>>>>> This is a quick way to kill the symptom (or low risk?). I am not
>>>>>>> sure if disabling NUMA is a better solution for this
>>>>>>> circumstance? does 1 NUMA node = UMA?
>>>>>>
>>>>>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In
>>>>>> the POWER7
>>>>>> machine you found the corner case (I copy below the data you
>>>>>> provided in the
>>>>>> JBS - thanks for the additional information):
>>>>>>
>>>>>> $ numactl -H
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>> node 0 size: 0 MB
>>>>>> node 0 free: 0 MB
>>>>>> node 1 cpus:
>>>>>> node 1 size: 7680 MB
>>>>>> node 1 free: 1896 MB
>>>>>> node distances:
>>>>>> node 0 1
>>>>>>    0: 10 40
>>>>>>    1: 40 10
>>>>>>
>>>>>> CPUs in node0 have no other alternative besides allocating memory
>>>>>> from node1. In
>>>>>> that case CPUs in node0 are always accessing remote memory from
>>>>>> node1 in a constant
>>>>>> distance (40), so in that case we could say that 1 NUMA
>>>>>> (configured) node == UMA.
>>>>>> Nonetheless, if you add CPUs in node1 (by filling up the other
>>>>>> socket present in
>>>>>> the board) you will end up with CPUs with different distances from
>>>>>> the node that
>>>>>> has configured memory (in that case, node1), so it yields a
>>>>>> configuration where
>>>>>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to
>>>>>> a single
>>>>>> value).
>>>>>>
>>>>>> On the other hand, the POWER7 machine configuration in question is
>>>>>> bad (and
>>>>>> rare). It's indeed impacting the whole system performance and it
>>>>>> would be
>>>>>> reasonable to open the machine and move the memory module from
>>>>>> bank related to
>>>>>> node1 to bank related to node0, because all CPUs are accessing
>>>>>> remote memory
>>>>>> without any apparent necessity. Once you change it all CPUs will
>>>>>> have local
>>>>>> memory (distance = 10).
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -Zhengyu
>>>>>>>
>>>>>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> There is a corner case that still failed after JDK-8175813.
>>>>>>>>
>>>>>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>>>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>>>>>> result "mbind: Invalid argument" message.
>>>>>>>>
>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>>>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>>>>>
>>>>>> Looks like that even for that POWER7 rare numa topology
>>>>>> numa_interleave_memory()
>>>>>> should succeed without "mbind: Invalid argument" since the 'mask'
>>>>>> argument
>>>>>> should be already a mask with only nodes from which memory can be
>>>>>> allocated, i.e.
>>>>>> only a mask of configured nodes (even if mask contains only one
>>>>>> configured node,
>>>>>> as in
>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>>>>>
>>>>>> Inspecting a little bit more, it looks like that the problem boils
>>>>>> down to the
>>>>>> fact that the JVM is passing to numa_interleave_memory()
>>>>>> 'numa_all_nodes' [1] in
>>>>>> Linux::numa_interleave_memory().
>>>>>>
>>>>>> One would expect that 'numa_all_nodes' (which is api v1) would
>>>>>> track the same
>>>>>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is
>>>>>> a subtle but
>>>>>> important difference:
>>>>>>
>>>>>> 'numa_all_nodes' is constructed assuming a consecutive node
>>>>>> distribution [3]:
>>>>>>
>>>>>> 100         max = numa_num_configured_nodes();
>>>>>> 101         for (i = 0; i < max; i++)
>>>>>> 102                 nodemask_set_compat((nodemask_t
>>>>>> *)&numa_all_nodes, i);
>>>>>>
>>>>>>
>>>>>> whilst 'numa_all_nodes_ptr' is constructed parsing
>>>>>> /proc/self/status [4]:
>>>>>>
>>>>>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>>>>>> 500                         numprocnode = read_mask(mask,
>>>>>> numa_all_nodes_ptr);
>>>>>>
>>>>>> Thus for a topology like:
>>>>>>
>>>>>> available: 4 nodes (0-1,16-17)
>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>> node 0 size: 130706 MB
>>>>>> node 0 free: 145 MB
>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>> node 1 size: 0 MB
>>>>>> node 1 free: 0 MB
>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>> node 16 size: 130630 MB
>>>>>> node 16 free: 529 MB
>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>> node 17 size: 0 MB
>>>>>> node 17 free: 0 MB
>>>>>> node distances:
>>>>>> node 0 1 16 17
>>>>>> 0: 10 20 40 40
>>>>>> 1: 20 10 40 40
>>>>>> 16: 40 40 10 20
>>>>>> 17: 40 40 20 10
>>>>>>
>>>>>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>>>>>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>>>>>
>>>>>> (Please, see details in the following gdb log:
>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>>>>>
>>>>>> In that case passing node0 and node1, although being suboptimal,
>>>>>> does not bother
>>>>>> mbind() since the following is satisfied:
>>>>>>
>>>>>> "[nodemask] must contain at least one node that is on-line,
>>>>>> allowed by the
>>>>>> process's current cpuset context, and contains memory."
>>>>>>
>>>>>> So back to the POWER7 case, I suppose that for:
>>>>>>
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>> node 0 size: 0 MB
>>>>>> node 0 free: 0 MB
>>>>>> node 1 cpus:
>>>>>> node 1 size: 7680 MB
>>>>>> node 1 free: 1896 MB
>>>>>> node distances:
>>>>>> node 0 1
>>>>>> 0: 10 40
>>>>>> 1: 40 10
>>>>>>
>>>>>> numa_all_nodes=0x1         => 0b01 (node0)
>>>>>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>>>>>
>>>>>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0),
>>>>>> which contains
>>>>>> indeed no memory. That said, I don't know for sure if passing just
>>>>>> node1 in the
>>>>>> 'nodemask' will satisfy mbind() as in that case there are no cpus
>>>>>> available in
>>>>>> node1.
>>>>>>
>>>>>> In summing up, looks like that the root cause is not that
>>>>>> numa_interleave_memory()
>>>>>> does not accept only one configured node, but that the configured
>>>>>> node being
>>>>>> passed is wrong. I could not find a similar numa topology in my
>>>>>> poll to test
>>>>>> more, but it might be worth trying to write a small test using api
>>>>>> v2 and
>>>>>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how
>>>>>> numa_interleave_memory()
>>>>>> goes in that machine :) If it behaves well, updating to api v2
>>>>>> would be a
>>>>>> solution.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Regards,
>>>>>> Gustavo
>>>>>>
>>>>>>
>>>>>> [1]
>>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
>>>>>>
>>>>>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes
>>>>>> with memory from which the calling process can allocate."
>>>>>> [3]
>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>>>>>> [4]
>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> The system NUMA configuration:
>>>>>>>>
>>>>>>>> Architecture: ppc64
>>>>>>>> CPU op-mode(s): 32-bit, 64-bit
>>>>>>>> Byte Order: Big Endian
>>>>>>>> CPU(s): 8
>>>>>>>> On-line CPU(s) list: 0-7
>>>>>>>> Thread(s) per core: 4
>>>>>>>> Core(s) per socket: 1
>>>>>>>> Socket(s): 2
>>>>>>>> NUMA node(s): 2
>>>>>>>> Model: 2.1 (pvr 003f 0201)
>>>>>>>> Model name: POWER7 (architected), altivec supported
>>>>>>>> L1d cache: 32K
>>>>>>>> L1i cache: 32K
>>>>>>>> NUMA node0 CPU(s): 0-7
>>>>>>>> NUMA node1 CPU(s):
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -Zhengyu
>>>>>>>
>>>>>>
>>>>
>>>