RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Tue May 30 21:30:16 UTC 2017

Looks fine to me.

Thanks,
David

On 30/05/2017 9:59 PM, Zhengyu Gu wrote:
> Hi David and Gustavo,
> 
> Thanks for the review.
> 
> Webrev is updated according to your comments:
> 
> http://cr.openjdk.java.net/~zgu/8181055/webrev.02/
> 
> Thanks,
> 
> -Zhengyu
> 
> 
> On 05/29/2017 07:06 PM, Gustavo Romero wrote:
>> Hi David,
>>
>> On 29-05-2017 01:34, David Holmes wrote:
>>> Hi Zhengyu,
>>>
>>> On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
>>>> Hi Gustavo,
>>>>
>>>> Thanks for the detail analysis and suggestion. I did not realize the 
>>>> difference between from bitmask and nodemask.
>>>>
>>>> As you suggested, numa_interleave_memory_v2 works under this 
>>>> configuration.
>>>>
>>>> Please updated Webrev: 
>>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.01/
>>>
>>> The addition of support for the "v2" API seems okay. Though I think 
>>> this comment needs some clarification for the existing code:
>>>
>>> 2837 // If we are running with libnuma version > 2, then we should
>>> 2838 // be trying to use symbols with versions 1.1
>>> 2839 // If we are running with earlier version, which did not have 
>>> symbol versions,
>>> 2840 // we should use the base version.
>>> 2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {
>>>
>>> given that we now explicitly load the v1.2 symbol if present.
>>>
>>> Gustavo: can you vouch for the suitability of using the v2 API in all 
>>> cases, if it exists?
>>
>> My understanding is that in the transition to API v2 only the usage of
>> numa_node_to_cpus() by the JVM will have to be adapted in 
>> os::Linux::rebuild_cpu_to_node_map().
>> The remaining functions (excluding numa_interleave_memory() as Zhengyu 
>> already addressed it)
>> preserve the same functionality and signatures [1].
>>
>> Currently JVM NUMA API requires the following libnuma functions:
>>
>> 1. numa_node_to_cpus            v1 != v2 (using v1, JVM has to adapt)
>> 2. numa_max_node                v1 == v2 (using v1, transition is 
>> straightforward)
>> 3. numa_num_configured_nodes    v2       (added by gromero: 8175813)
>> 4. numa_available               v1 == v2 (using v1, transition is 
>> straightforward)
>> 5. numa_tonode_memory           v1 == v2 (using v1, transition is 
>> straightforward)
>> 6. numa_interleave_memory       v1 != v2 (updated by zhengyu: 8181055. 
>> Default use of v2, fallback to v1)
>> 7. numa_set_bind_policy         v1 == v2 (using v1, transition is 
>> straightforward)
>> 8. numa_bitmask_isbitset        v2       (added by gromero: 8175813)
>> 9. numa_distance                v1 == v2 (added by gromero: 8175813. 
>> Using v1, transition is straightforward)
>>
>> v1 != v2: function signature in version 1 is different from version 2
>> v1 == v2: function signature in version 1 is equal to version 2
>> v2      : function is only present in API v2
>>
>> Thus, to the best of my knowledge, except for case 1. (which JVM need 
>> to adapt to)
>> all other cases are suitable to use v2 API and we could use a fallback 
>> mechanism as
>> proposed by Zhengyu or update directly to API v2 (risky?), given that 
>> I can't see
>> how v2 API would not be available on current (not-EOL) Linux distro 
>> releases.
>>
>> Regarding the comment, I agree, it needs an update since we are not 
>> tied anymore
>> to version 1.1 (we are in effect already using v2 for some functions). 
>> We could
>> delete the comment atop libnuma_dlsym() and add something like:
>>
>> "Handle request to load libnuma symbol version 1.1 (API v1). If it 
>> fails load symbol from base version instead."
>>
>> and to libnuma_v2_dlsym() add:
>>
>> "Handle request to load libnuma symbol version 1.2 (API v2) only. If 
>> it fails no symbol from any other version - even if present - is loaded."
>>
>> I've opened a bug to track the transitions to API v2 (I also discussed 
>> that with Volker):
>> https://bugs.openjdk.java.net/browse/JDK-8181196
>>
>>
>> Regards,
>> Gustavo
>>
>> [1] API v1 vs API v2:
>>
>> API v1
>> ======
>>
>> int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
>> int numa_max_node(void);
>> - int numa_num_configured_nodes(void);
>> int numa_available(void);
>> void numa_tonode_memory(void *start, size_t size, int node);
>> void numa_interleave_memory(void *start, size_t size, nodemask_t 
>> *nodemask);
>> void numa_set_bind_policy(int strict);
>> - int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>> int numa_distance(int node1, int node2);
>>
>>
>> API v2
>> ======
>>
>> int numa_node_to_cpus(int node, struct bitmask *mask);
>> int numa_max_node(void);
>> int numa_num_configured_nodes(void);
>> int numa_available(void);
>> void numa_tonode_memory(void *start, size_t size, int node);
>> void numa_interleave_memory(void *start, size_t size, struct bitmask 
>> *nodemask);
>> void numa_set_bind_policy(int strict)
>> int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>> int numa_distance(int node1, int node2);
>>
>>
>>> I'm running this through JPRT now.
>>>
>>> Thanks,
>>> David
>>>
>>>>
>>>> Thanks,
>>>>
>>>> -Zhengyu
>>>>
>>>>
>>>>
>>>> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>>>>> Hi Zhengyu,
>>>>>
>>>>> Thanks a lot for taking care of this corner case on PPC64.
>>>>>
>>>>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>>>>> This is a quick way to kill the symptom (or low risk?). I am not 
>>>>>> sure if disabling NUMA is a better solution for this circumstance? 
>>>>>> does 1 NUMA node = UMA?
>>>>>
>>>>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In 
>>>>> the POWER7
>>>>> machine you found the corner case (I copy below the data you 
>>>>> provided in the
>>>>> JBS - thanks for the additional information):
>>>>>
>>>>> $ numactl -H
>>>>> available: 2 nodes (0-1)
>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>> node 0 size: 0 MB
>>>>> node 0 free: 0 MB
>>>>> node 1 cpus:
>>>>> node 1 size: 7680 MB
>>>>> node 1 free: 1896 MB
>>>>> node distances:
>>>>> node 0 1
>>>>>    0: 10 40
>>>>>    1: 40 10
>>>>>
>>>>> CPUs in node0 have no other alternative besides allocating memory 
>>>>> from node1. In
>>>>> that case CPUs in node0 are always accessing remote memory from 
>>>>> node1 in a constant
>>>>> distance (40), so in that case we could say that 1 NUMA 
>>>>> (configured) node == UMA.
>>>>> Nonetheless, if you add CPUs in node1 (by filling up the other 
>>>>> socket present in
>>>>> the board) you will end up with CPUs with different distances from 
>>>>> the node that
>>>>> has configured memory (in that case, node1), so it yields a 
>>>>> configuration where
>>>>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to 
>>>>> a single
>>>>> value).
>>>>>
>>>>> On the other hand, the POWER7 machine configuration in question is 
>>>>> bad (and
>>>>> rare). It's indeed impacting the whole system performance and it 
>>>>> would be
>>>>> reasonable to open the machine and move the memory module from bank 
>>>>> related to
>>>>> node1 to bank related to node0, because all CPUs are accessing 
>>>>> remote memory
>>>>> without any apparent necessity. Once you change it all CPUs will 
>>>>> have local
>>>>> memory (distance = 10).
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Zhengyu
>>>>>>
>>>>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> There is a corner case that still failed after JDK-8175813.
>>>>>>>
>>>>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>>>>> result "mbind: Invalid argument" message.
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>>>>
>>>>> Looks like that even for that POWER7 rare numa topology 
>>>>> numa_interleave_memory()
>>>>> should succeed without "mbind: Invalid argument" since the 'mask' 
>>>>> argument
>>>>> should be already a mask with only nodes from which memory can be 
>>>>> allocated, i.e.
>>>>> only a mask of configured nodes (even if mask contains only one 
>>>>> configured node,
>>>>> as in 
>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>>>>
>>>>> Inspecting a little bit more, it looks like that the problem boils 
>>>>> down to the
>>>>> fact that the JVM is passing to numa_interleave_memory() 
>>>>> 'numa_all_nodes' [1] in
>>>>> Linux::numa_interleave_memory().
>>>>>
>>>>> One would expect that 'numa_all_nodes' (which is api v1) would 
>>>>> track the same
>>>>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is 
>>>>> a subtle but
>>>>> important difference:
>>>>>
>>>>> 'numa_all_nodes' is constructed assuming a consecutive node 
>>>>> distribution [3]:
>>>>>
>>>>> 100         max = numa_num_configured_nodes();
>>>>> 101         for (i = 0; i < max; i++)
>>>>> 102                 nodemask_set_compat((nodemask_t 
>>>>> *)&numa_all_nodes, i);
>>>>>
>>>>>
>>>>> whilst 'numa_all_nodes_ptr' is constructed parsing 
>>>>> /proc/self/status [4]:
>>>>>
>>>>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>>>>> 500                         numprocnode = read_mask(mask, 
>>>>> numa_all_nodes_ptr);
>>>>>
>>>>> Thus for a topology like:
>>>>>
>>>>> available: 4 nodes (0-1,16-17)
>>>>> node 0 cpus: 0 8 16 24 32
>>>>> node 0 size: 130706 MB
>>>>> node 0 free: 145 MB
>>>>> node 1 cpus: 40 48 56 64 72
>>>>> node 1 size: 0 MB
>>>>> node 1 free: 0 MB
>>>>> node 16 cpus: 80 88 96 104 112
>>>>> node 16 size: 130630 MB
>>>>> node 16 free: 529 MB
>>>>> node 17 cpus: 120 128 136 144 152
>>>>> node 17 size: 0 MB
>>>>> node 17 free: 0 MB
>>>>> node distances:
>>>>> node 0 1 16 17
>>>>> 0: 10 20 40 40
>>>>> 1: 20 10 40 40
>>>>> 16: 40 40 10 20
>>>>> 17: 40 40 20 10
>>>>>
>>>>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>>>>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>>>>
>>>>> (Please, see details in the following gdb log: 
>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>>>>
>>>>> In that case passing node0 and node1, although being suboptimal, 
>>>>> does not bother
>>>>> mbind() since the following is satisfied:
>>>>>
>>>>> "[nodemask] must contain at least one node that is on-line, allowed 
>>>>> by the
>>>>> process's current cpuset context, and contains memory."
>>>>>
>>>>> So back to the POWER7 case, I suppose that for:
>>>>>
>>>>> available: 2 nodes (0-1)
>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>> node 0 size: 0 MB
>>>>> node 0 free: 0 MB
>>>>> node 1 cpus:
>>>>> node 1 size: 7680 MB
>>>>> node 1 free: 1896 MB
>>>>> node distances:
>>>>> node 0 1
>>>>> 0: 10 40
>>>>> 1: 40 10
>>>>>
>>>>> numa_all_nodes=0x1         => 0b01 (node0)
>>>>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>>>>
>>>>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0), 
>>>>> which contains
>>>>> indeed no memory. That said, I don't know for sure if passing just 
>>>>> node1 in the
>>>>> 'nodemask' will satisfy mbind() as in that case there are no cpus 
>>>>> available in
>>>>> node1.
>>>>>
>>>>> In summing up, looks like that the root cause is not that 
>>>>> numa_interleave_memory()
>>>>> does not accept only one configured node, but that the configured 
>>>>> node being
>>>>> passed is wrong. I could not find a similar numa topology in my 
>>>>> poll to test
>>>>> more, but it might be worth trying to write a small test using api 
>>>>> v2 and
>>>>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how 
>>>>> numa_interleave_memory()
>>>>> goes in that machine :) If it behaves well, updating to api v2 
>>>>> would be a
>>>>> solution.
>>>>>
>>>>> HTH
>>>>>
>>>>> Regards,
>>>>> Gustavo
>>>>>
>>>>>
>>>>> [1] 
>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274 
>>>>>
>>>>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes 
>>>>> with memory from which the calling process can allocate."
>>>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>>>>> [4] https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>>>>
>>>>>
>>>>>>>
>>>>>>> The system NUMA configuration:
>>>>>>>
>>>>>>> Architecture: ppc64
>>>>>>> CPU op-mode(s): 32-bit, 64-bit
>>>>>>> Byte Order: Big Endian
>>>>>>> CPU(s): 8
>>>>>>> On-line CPU(s) list: 0-7
>>>>>>> Thread(s) per core: 4
>>>>>>> Core(s) per socket: 1
>>>>>>> Socket(s): 2
>>>>>>> NUMA node(s): 2
>>>>>>> Model: 2.1 (pvr 003f 0201)
>>>>>>> Model name: POWER7 (architected), altivec supported
>>>>>>> L1d cache: 32K
>>>>>>> L1i cache: 32K
>>>>>>> NUMA node0 CPU(s): 0-7
>>>>>>> NUMA node1 CPU(s):
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -Zhengyu
>>>>>>
>>>>>
>>>
>>