[10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Wed May 3 14:34:16 UTC 2017

Hi,

I've reviewed Gustavo's change and I'm fine with the latest version at:

http://cr.openjdk.java.net/~gromero/8175813/v3/

Can somebody please sponsor the change?

Thank you and best regards,
Volker

On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
<gromero at linux.vnet.ibm.com> wrote:
> Hi community,
>
> I understand that there is nothing that can be done additionally regarding this
> issue, at this point, on the PPC64 side.
>
> It's a change in the shared code - but that in effect does not change anything in
> the numa detection mechanism for other platforms - and hence it's necessary a
> conjoint community effort to review the change and a sponsor to run it against
> the JPRT.
>
> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
> the community could point out directions on how that change could move on.
>
> Thank you!
>
> Best regards,
> Gustavo
>
> On 25-04-2017 23:49, Gustavo Romero wrote:
>> Dear Volker,
>>
>> On 24-04-2017 14:08, Volker Simonis wrote:
>>> Hi Gustavo,
>>>
>>> thanks for addressing this problem and sorry for my late reply. I
>>> think this is a good change which definitely improves the situation
>>> for uncommon NUMA configurations without changing the handling for
>>> common topologies.
>>
>> Thanks a lot for reviewing the change!
>>
>>
>>> It would be great if somebody could run this trough JPRT, but as
>>> Gustavo mentioned, I don't expect any regressions.
>>>
>>> @Igor: I think you've been the original author of the NUMA-aware
>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>> linux"). If you could find some spare minutes to take a look at this
>>> change, your comment would be very much appreciated :)
>>>
>>> Following some minor comments from me:
>>>
>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>> to get the actual number of configured nodes. This is good and
>>> certainly an improvement over the previous implementation. However,
>>> the man page for numa_num_configured_nodes() mentions that the
>>> returned count may contain currently disabled nodes. Do we currently
>>> handle disabled nodes? What will be the consequence if we would use
>>> such a disabled node (e.g. mbind() warnings)?
>>
>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>> number of nodes with memory in the system. To the best of my knowledge there is
>> no system configuration on Linux/PPC64 that could match such a notion of
>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>> that dir and just the ones with memory will be taken into account. If it's
>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>> mbind() tried against it).
>>
>> On Power it's possible to have a numa node without memory (memory-less node, a
>> case covered in this change), a numa node without cpus at all but with memory
>> (a configured node anyway, so a case already covered) but to disable a specific
>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>> from the inners of the control module. Or other rare condition not invisible /
>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>> dir but is at the same time flagged as something like "disabled". There are
>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>
>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>
>>
>>> - the same question applies to the usage of
>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>> this be a potential problem (i.e. if we use a disabled node).
>>
>> On the meaning of "disabled nodes", it's the same case as above, so to the
>> best of knowledge it's not a potential problem.
>>
>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>> the same pointer returned by numa_get_membind() v2 [3] which:
>>
>> "returns the mask of nodes from which memory can currently be allocated"
>>
>> and that is used, for example, in "numactl --show" to show nodes from where
>> memory can be allocated [4, 5].
>>
>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>
>>
>>> - I'd like to suggest renaming the 'index' part of the following
>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>> in the following code, to emphasize that we have node indexes pointing
>>> to actual, not always consecutive node numbers:
>>>
>>> 2879         // Create an index -> node mapping, since nodes are not
>>> always consecutive
>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>> GrowableArray<int>(0, true);
>>> 2881         rebuild_index_to_node_map();
>>
>> Simple change but much better to read indeed. Done.
>>
>>
>>> - can you please wrap the following one-line else statement into curly
>>> braces (it's more readable and we usually do it that way in HotSpot
>>> although there are no formal style guidelines :)
>>>
>>> 2953      } else
>>> 2954        // Current node is already a configured node.
>>> 2955        closest_node = index_to_node()->at(i);
>>
>> Done.
>>
>>
>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>> later avoid the check for '|| !closest_distance'. Also, according to
>>> the man page, numa_distance() returns 0 if it can not determine the
>>> distance. So with the above change, the condition on line 2974 should
>>> read:
>>>
>>> 2947           if (distance && distance < closest_distance) {
>>>
>>
>> Sure, much better to set the initial condition as distant as possible and
>> adjust to a closer one bit by bit improving the if condition. Done.
>>
>>
>>> Finally, and not directly related to your change, I'd suggest the
>>> following clean-ups:
>>>
>>> - remove the usage of 'NCPUS = 32768' in
>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>> unclear to me and probably related to an older version/problem of
>>> libnuma? I think we should simply use
>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>
>>> - we still use the NUMA version 1 function prototypes (e.g.
>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>> also "numa_interleave_memory()" and maybe others). I think we should
>>> switch all prototypes to the new NUMA version 2 interface which you've
>>> already used for the new functions which you've added.
>>
>> I agree. Could I open a new bug to address these clean-ups?
>>
>>
>>> That said, I think these changes all require libnuma 2.0 (see
>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>> like to down-port this change. For jdk10 we could definitely do it,
>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>
>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>> for sure since it's on shared code?
>>
>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>
>> Thank you!
>>
>> Best regards,
>> Gustavo
>>
>>
>>> Regards,
>>> Volker
>>>
>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>> <gromero at linux.vnet.ibm.com> wrote:
>>>> Hi,
>>>>
>>>> Any update on it?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Gustavo
>>>>
>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>> Hi,
>>>>>
>>>>> Could the following webrev be reviewed please?
>>>>>
>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>> exist in the system.
>>>>>
>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>
>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>
>>>>> available: 2 nodes (0-1)
>>>>> node 0 cpus: 0 8 16 24 32
>>>>> node 0 size: 65258 MB
>>>>> node 0 free: 34 MB
>>>>> node 1 cpus: 40 48 56 64 72
>>>>> node 1 size: 65320 MB
>>>>> node 1 free: 150 MB
>>>>> node distances:
>>>>> node   0   1
>>>>>   0:  10  20
>>>>>   1:  20  10,
>>>>>
>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>> topology like:
>>>>>
>>>>> available: 4 nodes (0-1,16-17)
>>>>> node 0 cpus: 0 8 16 24 32
>>>>> node 0 size: 130706 MB
>>>>> node 0 free: 7729 MB
>>>>> node 1 cpus: 40 48 56 64 72
>>>>> node 1 size: 0 MB
>>>>> node 1 free: 0 MB
>>>>> node 16 cpus: 80 88 96 104 112
>>>>> node 16 size: 130630 MB
>>>>> node 16 free: 5282 MB
>>>>> node 17 cpus: 120 128 136 144 152
>>>>> node 17 size: 0 MB
>>>>> node 17 free: 0 MB
>>>>> node distances:
>>>>> node   0   1  16  17
>>>>>   0:  10  20  40  40
>>>>>   1:  20  10  40  40
>>>>>  16:  40  40  10  20
>>>>>  17:  40  40  20  10,
>>>>>
>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>> no memory.
>>>>>
>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>> messages:
>>>>>
>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>
>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>> be available:
>>>>>
>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>
>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>> performance.
>>>>>
>>>>> I found no regressions on x64 for the following numa topology:
>>>>>
>>>>> available: 2 nodes (0-1)
>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>> node 0 size: 24102 MB
>>>>> node 0 free: 19806 MB
>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>> node 1 size: 24190 MB
>>>>> node 1 free: 21951 MB
>>>>> node distances:
>>>>> node   0   1
>>>>>   0:  10  21
>>>>>   1:  21  10
>>>>>
>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Gustavo
>>>>>
>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>
>>>>
>>>
>>
>