RFR(XS) 8181055: "mbind: Invalid argument" still seen after 8175813

Thu Jun 1 14:49:40 UTC 2017

Hi David,

On 31-05-2017 22:48, David Holmes wrote:
> Hi Gustavo,
> 
> On 31/05/2017 10:47 PM, Gustavo Romero wrote:
>> Hi Zhengyu,
>>
>> On 30-05-2017 21:37, Zhengyu Gu wrote:
>>> Hi David,
>>>
>>> Thanks for the review.
>>>
>>> Gustavo, might I count you as a reviewer?
>>
>> Formally speaking (accordingly to the community Bylaws) I'm not a reviewer, so
>> I guess no.
> 
> You are not a Reviewer (capital 'R') but you can certainly review and be listed as a reviewer.

Got it! Thanks for clarifying.

Cheers,
Gustavo

> Cheers,
> David
> 
> 
>>
>> Kind regards,
>> Gustavo
>>
>>> Thanks,
>>>
>>> -Zhengyu
>>>
>>>
>>>
>>> On 05/30/2017 05:30 PM, David Holmes wrote:
>>>> Looks fine to me.
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On 30/05/2017 9:59 PM, Zhengyu Gu wrote:
>>>>> Hi David and Gustavo,
>>>>>
>>>>> Thanks for the review.
>>>>>
>>>>> Webrev is updated according to your comments:
>>>>>
>>>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.02/
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Zhengyu
>>>>>
>>>>>
>>>>> On 05/29/2017 07:06 PM, Gustavo Romero wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> On 29-05-2017 01:34, David Holmes wrote:
>>>>>>> Hi Zhengyu,
>>>>>>>
>>>>>>> On 29/05/2017 12:08 PM, Zhengyu Gu wrote:
>>>>>>>> Hi Gustavo,
>>>>>>>>
>>>>>>>> Thanks for the detail analysis and suggestion. I did not realize
>>>>>>>> the difference between from bitmask and nodemask.
>>>>>>>>
>>>>>>>> As you suggested, numa_interleave_memory_v2 works under this
>>>>>>>> configuration.
>>>>>>>>
>>>>>>>> Please updated Webrev:
>>>>>>>> http://cr.openjdk.java.net/~zgu/8181055/webrev.01/
>>>>>>>
>>>>>>> The addition of support for the "v2" API seems okay. Though I think
>>>>>>> this comment needs some clarification for the existing code:
>>>>>>>
>>>>>>> 2837 // If we are running with libnuma version > 2, then we should
>>>>>>> 2838 // be trying to use symbols with versions 1.1
>>>>>>> 2839 // If we are running with earlier version, which did not have
>>>>>>> symbol versions,
>>>>>>> 2840 // we should use the base version.
>>>>>>> 2841 void* os::Linux::libnuma_dlsym(void* handle, const char *name) {
>>>>>>>
>>>>>>> given that we now explicitly load the v1.2 symbol if present.
>>>>>>>
>>>>>>> Gustavo: can you vouch for the suitability of using the v2 API in
>>>>>>> all cases, if it exists?
>>>>>>
>>>>>> My understanding is that in the transition to API v2 only the usage of
>>>>>> numa_node_to_cpus() by the JVM will have to be adapted in
>>>>>> os::Linux::rebuild_cpu_to_node_map().
>>>>>> The remaining functions (excluding numa_interleave_memory() as
>>>>>> Zhengyu already addressed it)
>>>>>> preserve the same functionality and signatures [1].
>>>>>>
>>>>>> Currently JVM NUMA API requires the following libnuma functions:
>>>>>>
>>>>>> 1. numa_node_to_cpus            v1 != v2 (using v1, JVM has to adapt)
>>>>>> 2. numa_max_node                v1 == v2 (using v1, transition is
>>>>>> straightforward)
>>>>>> 3. numa_num_configured_nodes    v2       (added by gromero: 8175813)
>>>>>> 4. numa_available               v1 == v2 (using v1, transition is
>>>>>> straightforward)
>>>>>> 5. numa_tonode_memory           v1 == v2 (using v1, transition is
>>>>>> straightforward)
>>>>>> 6. numa_interleave_memory       v1 != v2 (updated by zhengyu:
>>>>>> 8181055. Default use of v2, fallback to v1)
>>>>>> 7. numa_set_bind_policy         v1 == v2 (using v1, transition is
>>>>>> straightforward)
>>>>>> 8. numa_bitmask_isbitset        v2       (added by gromero: 8175813)
>>>>>> 9. numa_distance                v1 == v2 (added by gromero: 8175813.
>>>>>> Using v1, transition is straightforward)
>>>>>>
>>>>>> v1 != v2: function signature in version 1 is different from version 2
>>>>>> v1 == v2: function signature in version 1 is equal to version 2
>>>>>> v2      : function is only present in API v2
>>>>>>
>>>>>> Thus, to the best of my knowledge, except for case 1. (which JVM need
>>>>>> to adapt to)
>>>>>> all other cases are suitable to use v2 API and we could use a
>>>>>> fallback mechanism as
>>>>>> proposed by Zhengyu or update directly to API v2 (risky?), given that
>>>>>> I can't see
>>>>>> how v2 API would not be available on current (not-EOL) Linux distro
>>>>>> releases.
>>>>>>
>>>>>> Regarding the comment, I agree, it needs an update since we are not
>>>>>> tied anymore
>>>>>> to version 1.1 (we are in effect already using v2 for some
>>>>>> functions). We could
>>>>>> delete the comment atop libnuma_dlsym() and add something like:
>>>>>>
>>>>>> "Handle request to load libnuma symbol version 1.1 (API v1). If it
>>>>>> fails load symbol from base version instead."
>>>>>>
>>>>>> and to libnuma_v2_dlsym() add:
>>>>>>
>>>>>> "Handle request to load libnuma symbol version 1.2 (API v2) only. If
>>>>>> it fails no symbol from any other version - even if present - is
>>>>>> loaded."
>>>>>>
>>>>>> I've opened a bug to track the transitions to API v2 (I also
>>>>>> discussed that with Volker):
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8181196
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Gustavo
>>>>>>
>>>>>> [1] API v1 vs API v2:
>>>>>>
>>>>>> API v1
>>>>>> ======
>>>>>>
>>>>>> int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen);
>>>>>> int numa_max_node(void);
>>>>>> - int numa_num_configured_nodes(void);
>>>>>> int numa_available(void);
>>>>>> void numa_tonode_memory(void *start, size_t size, int node);
>>>>>> void numa_interleave_memory(void *start, size_t size, nodemask_t
>>>>>> *nodemask);
>>>>>> void numa_set_bind_policy(int strict);
>>>>>> - int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>>>>> int numa_distance(int node1, int node2);
>>>>>>
>>>>>>
>>>>>> API v2
>>>>>> ======
>>>>>>
>>>>>> int numa_node_to_cpus(int node, struct bitmask *mask);
>>>>>> int numa_max_node(void);
>>>>>> int numa_num_configured_nodes(void);
>>>>>> int numa_available(void);
>>>>>> void numa_tonode_memory(void *start, size_t size, int node);
>>>>>> void numa_interleave_memory(void *start, size_t size, struct bitmask
>>>>>> *nodemask);
>>>>>> void numa_set_bind_policy(int strict)
>>>>>> int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n);
>>>>>> int numa_distance(int node1, int node2);
>>>>>>
>>>>>>
>>>>>>> I'm running this through JPRT now.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -Zhengyu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/26/2017 08:34 PM, Gustavo Romero wrote:
>>>>>>>>> Hi Zhengyu,
>>>>>>>>>
>>>>>>>>> Thanks a lot for taking care of this corner case on PPC64.
>>>>>>>>>
>>>>>>>>> On 26-05-2017 10:41, Zhengyu Gu wrote:
>>>>>>>>>> This is a quick way to kill the symptom (or low risk?). I am not
>>>>>>>>>> sure if disabling NUMA is a better solution for this
>>>>>>>>>> circumstance? does 1 NUMA node = UMA?
>>>>>>>>>
>>>>>>>>> On PPC64, 1 (configured) NUMA does not necessarily imply UMA. In
>>>>>>>>> the POWER7
>>>>>>>>> machine you found the corner case (I copy below the data you
>>>>>>>>> provided in the
>>>>>>>>> JBS - thanks for the additional information):
>>>>>>>>>
>>>>>>>>> $ numactl -H
>>>>>>>>> available: 2 nodes (0-1)
>>>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>>>>> node 0 size: 0 MB
>>>>>>>>> node 0 free: 0 MB
>>>>>>>>> node 1 cpus:
>>>>>>>>> node 1 size: 7680 MB
>>>>>>>>> node 1 free: 1896 MB
>>>>>>>>> node distances:
>>>>>>>>> node 0 1
>>>>>>>>>     0: 10 40
>>>>>>>>>     1: 40 10
>>>>>>>>>
>>>>>>>>> CPUs in node0 have no other alternative besides allocating memory
>>>>>>>>> from node1. In
>>>>>>>>> that case CPUs in node0 are always accessing remote memory from
>>>>>>>>> node1 in a constant
>>>>>>>>> distance (40), so in that case we could say that 1 NUMA
>>>>>>>>> (configured) node == UMA.
>>>>>>>>> Nonetheless, if you add CPUs in node1 (by filling up the other
>>>>>>>>> socket present in
>>>>>>>>> the board) you will end up with CPUs with different distances from
>>>>>>>>> the node that
>>>>>>>>> has configured memory (in that case, node1), so it yields a
>>>>>>>>> configuration where
>>>>>>>>> 1 NUMA (configured) != UMA (i.e. distances are not always equal to
>>>>>>>>> a single
>>>>>>>>> value).
>>>>>>>>>
>>>>>>>>> On the other hand, the POWER7 machine configuration in question is
>>>>>>>>> bad (and
>>>>>>>>> rare). It's indeed impacting the whole system performance and it
>>>>>>>>> would be
>>>>>>>>> reasonable to open the machine and move the memory module from
>>>>>>>>> bank related to
>>>>>>>>> node1 to bank related to node0, because all CPUs are accessing
>>>>>>>>> remote memory
>>>>>>>>> without any apparent necessity. Once you change it all CPUs will
>>>>>>>>> have local
>>>>>>>>> memory (distance = 10).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> -Zhengyu
>>>>>>>>>>
>>>>>>>>>> On 05/26/2017 09:14 AM, Zhengyu Gu wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> There is a corner case that still failed after JDK-8175813.
>>>>>>>>>>>
>>>>>>>>>>> The system shows that it has multiple NUMA nodes, but only one is
>>>>>>>>>>> configured. Under this scenario, numa_interleave_memory() call will
>>>>>>>>>>> result "mbind: Invalid argument" message.
>>>>>>>>>>>
>>>>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8181055
>>>>>>>>>>> Webrev: http://cr.openjdk.java.net/~zgu/8181055/webrev.00/
>>>>>>>>>
>>>>>>>>> Looks like that even for that POWER7 rare numa topology
>>>>>>>>> numa_interleave_memory()
>>>>>>>>> should succeed without "mbind: Invalid argument" since the 'mask'
>>>>>>>>> argument
>>>>>>>>> should be already a mask with only nodes from which memory can be
>>>>>>>>> allocated, i.e.
>>>>>>>>> only a mask of configured nodes (even if mask contains only one
>>>>>>>>> configured node,
>>>>>>>>> as in
>>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_only_one_node.txt).
>>>>>>>>>
>>>>>>>>> Inspecting a little bit more, it looks like that the problem boils
>>>>>>>>> down to the
>>>>>>>>> fact that the JVM is passing to numa_interleave_memory()
>>>>>>>>> 'numa_all_nodes' [1] in
>>>>>>>>> Linux::numa_interleave_memory().
>>>>>>>>>
>>>>>>>>> One would expect that 'numa_all_nodes' (which is api v1) would
>>>>>>>>> track the same
>>>>>>>>> information as 'numa_all_nodes_ptr' (api v2) [2], however there is
>>>>>>>>> a subtle but
>>>>>>>>> important difference:
>>>>>>>>>
>>>>>>>>> 'numa_all_nodes' is constructed assuming a consecutive node
>>>>>>>>> distribution [3]:
>>>>>>>>>
>>>>>>>>> 100         max = numa_num_configured_nodes();
>>>>>>>>> 101         for (i = 0; i < max; i++)
>>>>>>>>> 102                 nodemask_set_compat((nodemask_t
>>>>>>>>> *)&numa_all_nodes, i);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> whilst 'numa_all_nodes_ptr' is constructed parsing
>>>>>>>>> /proc/self/status [4]:
>>>>>>>>>
>>>>>>>>> 499                 if (strncmp(buffer,"Mems_allowed:",13) == 0) {
>>>>>>>>> 500                         numprocnode = read_mask(mask,
>>>>>>>>> numa_all_nodes_ptr);
>>>>>>>>>
>>>>>>>>> Thus for a topology like:
>>>>>>>>>
>>>>>>>>> available: 4 nodes (0-1,16-17)
>>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>>> node 0 size: 130706 MB
>>>>>>>>> node 0 free: 145 MB
>>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>>> node 1 size: 0 MB
>>>>>>>>> node 1 free: 0 MB
>>>>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>>>>> node 16 size: 130630 MB
>>>>>>>>> node 16 free: 529 MB
>>>>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>>>>> node 17 size: 0 MB
>>>>>>>>> node 17 free: 0 MB
>>>>>>>>> node distances:
>>>>>>>>> node 0 1 16 17
>>>>>>>>> 0: 10 20 40 40
>>>>>>>>> 1: 20 10 40 40
>>>>>>>>> 16: 40 40 10 20
>>>>>>>>> 17: 40 40 20 10
>>>>>>>>>
>>>>>>>>> numa_all_nodes=0x3         => 0b11                (node0 and node1)
>>>>>>>>> numa_all_nodes_ptr=0x10001 => 0b10000000000000001 (node0 and node16)
>>>>>>>>>
>>>>>>>>> (Please, see details in the following gdb log:
>>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/numa_api_v1_vs_api_v2.txt)
>>>>>>>>>
>>>>>>>>> In that case passing node0 and node1, although being suboptimal,
>>>>>>>>> does not bother
>>>>>>>>> mbind() since the following is satisfied:
>>>>>>>>>
>>>>>>>>> "[nodemask] must contain at least one node that is on-line,
>>>>>>>>> allowed by the
>>>>>>>>> process's current cpuset context, and contains memory."
>>>>>>>>>
>>>>>>>>> So back to the POWER7 case, I suppose that for:
>>>>>>>>>
>>>>>>>>> available: 2 nodes (0-1)
>>>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7
>>>>>>>>> node 0 size: 0 MB
>>>>>>>>> node 0 free: 0 MB
>>>>>>>>> node 1 cpus:
>>>>>>>>> node 1 size: 7680 MB
>>>>>>>>> node 1 free: 1896 MB
>>>>>>>>> node distances:
>>>>>>>>> node 0 1
>>>>>>>>> 0: 10 40
>>>>>>>>> 1: 40 10
>>>>>>>>>
>>>>>>>>> numa_all_nodes=0x1         => 0b01 (node0)
>>>>>>>>> numa_all_nodes_ptr=0x2     => 0b10 (node1)
>>>>>>>>>
>>>>>>>>> and hence numa_interleave_memory() gets nodemask = 0x1 (node0),
>>>>>>>>> which contains
>>>>>>>>> indeed no memory. That said, I don't know for sure if passing just
>>>>>>>>> node1 in the
>>>>>>>>> 'nodemask' will satisfy mbind() as in that case there are no cpus
>>>>>>>>> available in
>>>>>>>>> node1.
>>>>>>>>>
>>>>>>>>> In summing up, looks like that the root cause is not that
>>>>>>>>> numa_interleave_memory()
>>>>>>>>> does not accept only one configured node, but that the configured
>>>>>>>>> node being
>>>>>>>>> passed is wrong. I could not find a similar numa topology in my
>>>>>>>>> poll to test
>>>>>>>>> more, but it might be worth trying to write a small test using api
>>>>>>>>> v2 and
>>>>>>>>> 'numa_all_nodes_ptr' instead of 'numa_all_nodes' to see how
>>>>>>>>> numa_interleave_memory()
>>>>>>>>> goes in that machine :) If it behaves well, updating to api v2
>>>>>>>>> would be a
>>>>>>>>> solution.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Gustavo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/4b93e1b1d5b7/src/os/linux/vm/os_linux.hpp#l274
>>>>>>>>>
>>>>>>>>> [2] from libnuma.c:608 numa_all_nodes_ptr: "it only tracks nodes
>>>>>>>>> with memory from which the calling process can allocate."
>>>>>>>>> [3]
>>>>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L100-L102
>>>>>>>>> [4]
>>>>>>>>> https://github.com/numactl/numactl/blob/master/libnuma.c#L499-L500
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The system NUMA configuration:
>>>>>>>>>>>
>>>>>>>>>>> Architecture: ppc64
>>>>>>>>>>> CPU op-mode(s): 32-bit, 64-bit
>>>>>>>>>>> Byte Order: Big Endian
>>>>>>>>>>> CPU(s): 8
>>>>>>>>>>> On-line CPU(s) list: 0-7
>>>>>>>>>>> Thread(s) per core: 4
>>>>>>>>>>> Core(s) per socket: 1
>>>>>>>>>>> Socket(s): 2
>>>>>>>>>>> NUMA node(s): 2
>>>>>>>>>>> Model: 2.1 (pvr 003f 0201)
>>>>>>>>>>> Model name: POWER7 (architected), altivec supported
>>>>>>>>>>> L1d cache: 32K
>>>>>>>>>>> L1i cache: 32K
>>>>>>>>>>> NUMA node0 CPU(s): 0-7
>>>>>>>>>>> NUMA node1 CPU(s):
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> -Zhengyu
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>
>