RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Thu Jun 1 01:48:31 UTC 2017

Hi Gustavo,

On 31/05/2017 10:47 PM, Gustavo Romero wrote:
> Hi Zhengyu,
> 
> On 30-05-2017 21:37, Zhengyu Gu wrote:
>> Hi David,
>>
>> Thanks for the review.
>>
>> Gustavo, might I count you as a reviewer?
> 
> Formally speaking (accordingly to the community Bylaws) I'm not a reviewer, so
> I guess no.

You are not a Reviewer (capital 'R') but you can certainly review and be 
listed as a reviewer.

Cheers,
David

> 
> Kind regards,
> Gustavo
> 
>> Thanks,
>>
>> -Zhengyu
>>
>>
>>
>> On 05/30/2017 05:30 PM, David Holmes wrote:
>>> Looks fine to me.
>>>
>>> Thanks,
>>> David
>>>
>>> On 30/05/2017 9:59 PM, Zhengyu Gu wrote:
>>>> Hi David and Gustavo,
>>>>
>>>> Thanks for the review.
>>>>
>>>> Webrev is updated according to your comments:
>>>>
>>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.02/
>>>>
>>>> Thanks,
>>>>
>>>> -Zhengyu
>>>>
>>>>
>>>> On 05/29/2017 07:06 PM, Gustavo Romero wrote:
>>>>> Hi David,
>>>>>
>>>>> On 29-05-2017 01:34, David Holmes wrote:
>>>>>> Hi Zhengyu,
>>>>>>
>>>>>> On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
>>>>>>> Hi Gustavo,
>>>>>>>
>>>>>>> Thanks for the detail analysis and suggestion. I did not realize
>>>>>>> the difference between from bitmask and nodemask.
>>>>>>>
>>>>>>> As you suggested, numa_interleave_memory_v2 works under this
>>>>>>> configuration.
>>>>>>>
>>>>>>> Please updated Webrev:
>>>>>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.01/
>>>>>>
>>>>>> The addition of support for the "v2" API seems okay. Though I think
>>>>>> this comment needs some clarification for the existing code:
>>>>>>
>>>>>> 2837 // If we are running with libnuma version > 2, then we should
>>>>>> 2838 // be trying to use symbols with versions 1.1
>>>>>> 2839 // If we are running with earlier version, which did not have
>>>>>> symbol versions,
>>>>>> 2840 // we should use the base version.
>>>>>> 2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {
>>>>>>
>>>>>> given that we now explicitly load the v1.2 symbol if present.
>>>>>>
>>>>>> Gustavo: can you vouch for the suitability of using the v2 API in
>>>>>> all cases, if it exists?
>>>>>
>>>>> My understanding is that in the transition to API v2 only the usage of
>>>>> numa_node_to_cpus() by the JVM will have to be adapted in
>>>>> os::Linux::rebuild_cpu_to_node_map().
>>>>> The remaining functions (excluding numa_interleave_memory() as
>>>>> Zhengyu already addressed it)
>>>>> preserve the same functionality and signatures [1].
>>>>>
>>>>> Currently JVM NUMA API requires the following libnuma functions:
>>>>>
>>>>> 1. numa_node_to_cpus            v1 != v2 (using v1, JVM has to adapt)
>>>>> 2. numa_max_node                v1 == v2 (using v1, transition is
>>>>> straightforward)
>>>>> 3. numa_num_configured_nodes    v2       (added by gromero: 8175813)
>>>>> 4. numa_available               v1 == v2 (using v1, transition is
>>>>> straightforward)
>>>>> 5. numa_tonode_memory           v1 == v2 (using v1, transition is
>>>>> straightforward)
>>>>> 6. numa_interleave_memory       v1 != v2 (updated by zhengyu:
>>>>> 8181055. Default use of v2, fallback to v1)
>>>>> 7. numa_set_bind_policy         v1 == v2 (using v1, transition is
>>>>> straightforward)
>>>>> 8. numa_bitmask_isbitset        v2       (added by gromero: 8175813)
>>>>> 9. numa_distance                v1 == v2 (added by gromero: 8175813.
>>>>> Using v1, transition is straightforward)
>>>>>
>>>>> v1 != v2: function signature in version 1 is different from version 2
>>>>> v1 == v2: function signature in version 1 is equal to version 2
>>>>> v2      : function is only present in API v2
>>>>>
>>>>> Thus, to the best of my knowledge, except for case 1. (which JVM need
>>>>> to adapt to)
>>>>> all other cases are suitable to use v2 API and we could use a
>>>>> fallback mechanism as
>>>>> proposed by Zhengyu or update directly to API v2 (risky?), given that
>>>>> I can't see
>>>>> how v2 API would not be available on current (not-EOL) Linux distro
>>>>> releases.
>>>>>
>>>>> Regarding the comment, I agree, it needs an update since we are not
>>>>> tied anymore
>>>>> to version 1.1 (we are in effect already using v2 for some
>>>>> functions). We could
>>>>> delete the comment atop libnuma_dlsym() and add something like:
>>>>>
>>>>> "Handle request to load libnuma symbol version 1.1 (API v1). If it
>>>>> fails load symbol from base version instead."
>>>>>
>>>>> and to libnuma_v2_dlsym() add:
>>>>>
>>>>> "Handle request to load libnuma symbol version 1.2 (API v2) only. If
>>>>> it fails no symbol from any other version - even if present - is
>>>>> loaded."
>>>>>
>>>>> I've opened a bug to track the transitions to API v2 (I also
>>>>> discussed that with Volker):
>>>>> https://bugs.openjdk.java.net/browse/JDK-8181196
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gustavo
>>>>>
>>>>> [1] API v1 vs API v2:
>>>>>
>>>>> API v1
>>>>> ======
>>>>>
>>>>> int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
>>>>> int numa_max_node(void);
>>>>> - int numa_num_configured_nodes(void);
>>>>> int numa_available(void);
>>>>> void numa_tonode_memory(void *start, size_t size, int node);
>>>>> void numa_interleave_memory(void *start, size_t size, nodemask_t
>>>>> *nodemask);
>>>>> void numa_set_bind_policy(int strict);
>>>>> - int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>>>> int numa_distance(int node1, int node2);
>>>>>
>>>>>
>>>>> API v2
>>>>> ======
>>>>>
>>>>> int numa_node_to_cpus(int node, struct bitmask *mask);
>>>>> int numa_max_node(void);
>>>>> int numa_num_configured_nodes(void);
>>>>> int numa_available(void);
>>>>> void numa_tonode_memory(void *start, size_t size, int node);
>>>>> void numa_interleave_memory(void *start, size_t size, struct bitmask
>>>>> *nodemask);
>>>>> void numa_set_bind_policy(int strict)
>>>>> int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>>>> int numa_distance(int node1, int node2);
>>>>>
>>>>>
>>>>>> I'm running this through JPRT now.
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -Zhengyu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>>>>>>>> Hi Zhengyu,
>>>>>>>>
>>>>>>>> Thanks a lot for taking care of this corner case on PPC64.
>>>>>>>>
>>>>>>>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>>>>>>>> This is a quick way to kill the symptom (or low risk?). I am not
>>>>>>>>> sure if disabling NUMA is a better solution for this
>>>>>>>>> circumstance? does 1 NUMA node = UMA?
>>>>>>>>
>>>>>>>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In
>>>>>>>> the POWER7
>>>>>>>> machine you found the corner case (I copy below the data you
>>>>>>>> provided in the
>>>>>>>> JBS - thanks for the additional information):
>>>>>>>>
>>>>>>>> $ numactl -H
>>>>>>>> available: 2 nodes (0-1)
>>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>>>> node 0 size: 0 MB
>>>>>>>> node 0 free: 0 MB
>>>>>>>> node 1 cpus:
>>>>>>>> node 1 size: 7680 MB
>>>>>>>> node 1 free: 1896 MB
>>>>>>>> node distances:
>>>>>>>> node 0 1
>>>>>>>>     0: 10 40
>>>>>>>>     1: 40 10
>>>>>>>>
>>>>>>>> CPUs in node0 have no other alternative besides allocating memory
>>>>>>>> from node1. In
>>>>>>>> that case CPUs in node0 are always accessing remote memory from
>>>>>>>> node1 in a constant
>>>>>>>> distance (40), so in that case we could say that 1 NUMA
>>>>>>>> (configured) node == UMA.
>>>>>>>> Nonetheless, if you add CPUs in node1 (by filling up the other
>>>>>>>> socket present in
>>>>>>>> the board) you will end up with CPUs with different distances from
>>>>>>>> the node that
>>>>>>>> has configured memory (in that case, node1), so it yields a
>>>>>>>> configuration where
>>>>>>>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to
>>>>>>>> a single
>>>>>>>> value).
>>>>>>>>
>>>>>>>> On the other hand, the POWER7 machine configuration in question is
>>>>>>>> bad (and
>>>>>>>> rare). It's indeed impacting the whole system performance and it
>>>>>>>> would be
>>>>>>>> reasonable to open the machine and move the memory module from
>>>>>>>> bank related to
>>>>>>>> node1 to bank related to node0, because all CPUs are accessing
>>>>>>>> remote memory
>>>>>>>> without any apparent necessity. Once you change it all CPUs will
>>>>>>>> have local
>>>>>>>> memory (distance = 10).
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -Zhengyu
>>>>>>>>>
>>>>>>>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> There is a corner case that still failed after JDK-8175813.
>>>>>>>>>>
>>>>>>>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>>>>>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>>>>>>>> result "mbind: Invalid argument" message.
>>>>>>>>>>
>>>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>>>>>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>>>>>>>
>>>>>>>> Looks like that even for that POWER7 rare numa topology
>>>>>>>> numa_interleave_memory()
>>>>>>>> should succeed without "mbind: Invalid argument" since the 'mask'
>>>>>>>> argument
>>>>>>>> should be already a mask with only nodes from which memory can be
>>>>>>>> allocated, i.e.
>>>>>>>> only a mask of configured nodes (even if mask contains only one
>>>>>>>> configured node,
>>>>>>>> as in
>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>>>>>>>
>>>>>>>> Inspecting a little bit more, it looks like that the problem boils
>>>>>>>> down to the
>>>>>>>> fact that the JVM is passing to numa_interleave_memory()
>>>>>>>> 'numa_all_nodes' [1] in
>>>>>>>> Linux::numa_interleave_memory().
>>>>>>>>
>>>>>>>> One would expect that 'numa_all_nodes' (which is api v1) would
>>>>>>>> track the same
>>>>>>>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is
>>>>>>>> a subtle but
>>>>>>>> important difference:
>>>>>>>>
>>>>>>>> 'numa_all_nodes' is constructed assuming a consecutive node
>>>>>>>> distribution [3]:
>>>>>>>>
>>>>>>>> 100         max = numa_num_configured_nodes();
>>>>>>>> 101         for (i = 0; i < max; i++)
>>>>>>>> 102                 nodemask_set_compat((nodemask_t
>>>>>>>> *)&numa_all_nodes, i);
>>>>>>>>
>>>>>>>>
>>>>>>>> whilst 'numa_all_nodes_ptr' is constructed parsing
>>>>>>>> /proc/self/status [4]:
>>>>>>>>
>>>>>>>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>>>>>>>> 500                         numprocnode = read_mask(mask,
>>>>>>>> numa_all_nodes_ptr);
>>>>>>>>
>>>>>>>> Thus for a topology like:
>>>>>>>>
>>>>>>>> available: 4 nodes (0-1,16-17)
>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>> node 0 size: 130706 MB
>>>>>>>> node 0 free: 145 MB
>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>> node 1 size: 0 MB
>>>>>>>> node 1 free: 0 MB
>>>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>>>> node 16 size: 130630 MB
>>>>>>>> node 16 free: 529 MB
>>>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>>>> node 17 size: 0 MB
>>>>>>>> node 17 free: 0 MB
>>>>>>>> node distances:
>>>>>>>> node 0 1 16 17
>>>>>>>> 0: 10 20 40 40
>>>>>>>> 1: 20 10 40 40
>>>>>>>> 16: 40 40 10 20
>>>>>>>> 17: 40 40 20 10
>>>>>>>>
>>>>>>>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>>>>>>>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>>>>>>>
>>>>>>>> (Please, see details in the following gdb log:
>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>>>>>>>
>>>>>>>> In that case passing node0 and node1, although being suboptimal,
>>>>>>>> does not bother
>>>>>>>> mbind() since the following is satisfied:
>>>>>>>>
>>>>>>>> "[nodemask] must contain at least one node that is on-line,
>>>>>>>> allowed by the
>>>>>>>> process's current cpuset context, and contains memory."
>>>>>>>>
>>>>>>>> So back to the POWER7 case, I suppose that for:
>>>>>>>>
>>>>>>>> available: 2 nodes (0-1)
>>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>>>> node 0 size: 0 MB
>>>>>>>> node 0 free: 0 MB
>>>>>>>> node 1 cpus:
>>>>>>>> node 1 size: 7680 MB
>>>>>>>> node 1 free: 1896 MB
>>>>>>>> node distances:
>>>>>>>> node 0 1
>>>>>>>> 0: 10 40
>>>>>>>> 1: 40 10
>>>>>>>>
>>>>>>>> numa_all_nodes=0x1         => 0b01 (node0)
>>>>>>>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>>>>>>>
>>>>>>>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0),
>>>>>>>> which contains
>>>>>>>> indeed no memory. That said, I don't know for sure if passing just
>>>>>>>> node1 in the
>>>>>>>> 'nodemask' will satisfy mbind() as in that case there are no cpus
>>>>>>>> available in
>>>>>>>> node1.
>>>>>>>>
>>>>>>>> In summing up, looks like that the root cause is not that
>>>>>>>> numa_interleave_memory()
>>>>>>>> does not accept only one configured node, but that the configured
>>>>>>>> node being
>>>>>>>> passed is wrong. I could not find a similar numa topology in my
>>>>>>>> poll to test
>>>>>>>> more, but it might be worth trying to write a small test using api
>>>>>>>> v2 and
>>>>>>>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how
>>>>>>>> numa_interleave_memory()
>>>>>>>> goes in that machine :) If it behaves well, updating to api v2
>>>>>>>> would be a
>>>>>>>> solution.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Gustavo
>>>>>>>>
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
>>>>>>>>
>>>>>>>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes
>>>>>>>> with memory from which the calling process can allocate."
>>>>>>>> [3]
>>>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>>>>>>>> [4]
>>>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The system NUMA configuration:
>>>>>>>>>>
>>>>>>>>>> Architecture: ppc64
>>>>>>>>>> CPU op-mode(s): 32-bit, 64-bit
>>>>>>>>>> Byte Order: Big Endian
>>>>>>>>>> CPU(s): 8
>>>>>>>>>> On-line CPU(s) list: 0-7
>>>>>>>>>> Thread(s) per core: 4
>>>>>>>>>> Core(s) per socket: 1
>>>>>>>>>> Socket(s): 2
>>>>>>>>>> NUMA node(s): 2
>>>>>>>>>> Model: 2.1 (pvr 003f 0201)
>>>>>>>>>> Model name: POWER7 (architected), altivec supported
>>>>>>>>>> L1d cache: 32K
>>>>>>>>>> L1i cache: 32K
>>>>>>>>>> NUMA node0 CPU(s): 0-7
>>>>>>>>>> NUMA node1 CPU(s):
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> -Zhengyu
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>
>