[10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Thu May 4 01:50:27 UTC 2017

Hi Volker, Gustavo,

I will try to take a look at this again, but may be a day or two.

David

On 4/05/2017 12:34 AM, Volker Simonis wrote:
> Hi,
>
> I've reviewed Gustavo's change and I'm fine with the latest version at:
>
> http://cr.openjdk.java.net/~gromero/8175813/v3/
>
> Can somebody please sponsor the change?
>
> Thank you and best regards,
> Volker
>
>
> On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
> <gromero at linux.vnet.ibm.com> wrote:
>> Hi community,
>>
>> I understand that there is nothing that can be done additionally regarding this
>> issue, at this point, on the PPC64 side.
>>
>> It's a change in the shared code - but that in effect does not change anything in
>> the numa detection mechanism for other platforms - and hence it's necessary a
>> conjoint community effort to review the change and a sponsor to run it against
>> the JPRT.
>>
>> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
>> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
>> the community could point out directions on how that change could move on.
>>
>> Thank you!
>>
>> Best regards,
>> Gustavo
>>
>> On 25-04-2017 23:49, Gustavo Romero wrote:
>>> Dear Volker,
>>>
>>> On 24-04-2017 14:08, Volker Simonis wrote:
>>>> Hi Gustavo,
>>>>
>>>> thanks for addressing this problem and sorry for my late reply. I
>>>> think this is a good change which definitely improves the situation
>>>> for uncommon NUMA configurations without changing the handling for
>>>> common topologies.
>>>
>>> Thanks a lot for reviewing the change!
>>>
>>>
>>>> It would be great if somebody could run this trough JPRT, but as
>>>> Gustavo mentioned, I don't expect any regressions.
>>>>
>>>> @Igor: I think you've been the original author of the NUMA-aware
>>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>>> linux"). If you could find some spare minutes to take a look at this
>>>> change, your comment would be very much appreciated :)
>>>>
>>>> Following some minor comments from me:
>>>>
>>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>>> to get the actual number of configured nodes. This is good and
>>>> certainly an improvement over the previous implementation. However,
>>>> the man page for numa_num_configured_nodes() mentions that the
>>>> returned count may contain currently disabled nodes. Do we currently
>>>> handle disabled nodes? What will be the consequence if we would use
>>>> such a disabled node (e.g. mbind() warnings)?
>>>
>>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>>> number of nodes with memory in the system. To the best of my knowledge there is
>>> no system configuration on Linux/PPC64 that could match such a notion of
>>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>>> that dir and just the ones with memory will be taken into account. If it's
>>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>>> mbind() tried against it).
>>>
>>> On Power it's possible to have a numa node without memory (memory-less node, a
>>> case covered in this change), a numa node without cpus at all but with memory
>>> (a configured node anyway, so a case already covered) but to disable a specific
>>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>>> from the inners of the control module. Or other rare condition not invisible /
>>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>>> dir but is at the same time flagged as something like "disabled". There are
>>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>>
>>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>>
>>>
>>>> - the same question applies to the usage of
>>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>>> this be a potential problem (i.e. if we use a disabled node).
>>>
>>> On the meaning of "disabled nodes", it's the same case as above, so to the
>>> best of knowledge it's not a potential problem.
>>>
>>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>>> the same pointer returned by numa_get_membind() v2 [3] which:
>>>
>>> "returns the mask of nodes from which memory can currently be allocated"
>>>
>>> and that is used, for example, in "numactl --show" to show nodes from where
>>> memory can be allocated [4, 5].
>>>
>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>>
>>>
>>>> - I'd like to suggest renaming the 'index' part of the following
>>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>>> in the following code, to emphasize that we have node indexes pointing
>>>> to actual, not always consecutive node numbers:
>>>>
>>>> 2879         // Create an index -> node mapping, since nodes are not
>>>> always consecutive
>>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>>> GrowableArray<int>(0, true);
>>>> 2881         rebuild_index_to_node_map();
>>>
>>> Simple change but much better to read indeed. Done.
>>>
>>>
>>>> - can you please wrap the following one-line else statement into curly
>>>> braces (it's more readable and we usually do it that way in HotSpot
>>>> although there are no formal style guidelines :)
>>>>
>>>> 2953      } else
>>>> 2954        // Current node is already a configured node.
>>>> 2955        closest_node = index_to_node()->at(i);
>>>
>>> Done.
>>>
>>>
>>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>>> later avoid the check for '|| !closest_distance'. Also, according to
>>>> the man page, numa_distance() returns 0 if it can not determine the
>>>> distance. So with the above change, the condition on line 2974 should
>>>> read:
>>>>
>>>> 2947           if (distance && distance < closest_distance) {
>>>>
>>>
>>> Sure, much better to set the initial condition as distant as possible and
>>> adjust to a closer one bit by bit improving the if condition. Done.
>>>
>>>
>>>> Finally, and not directly related to your change, I'd suggest the
>>>> following clean-ups:
>>>>
>>>> - remove the usage of 'NCPUS = 32768' in
>>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>>> unclear to me and probably related to an older version/problem of
>>>> libnuma? I think we should simply use
>>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>>
>>>> - we still use the NUMA version 1 function prototypes (e.g.
>>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>>> also "numa_interleave_memory()" and maybe others). I think we should
>>>> switch all prototypes to the new NUMA version 2 interface which you've
>>>> already used for the new functions which you've added.
>>>
>>> I agree. Could I open a new bug to address these clean-ups?
>>>
>>>
>>>> That said, I think these changes all require libnuma 2.0 (see
>>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>>> like to down-port this change. For jdk10 we could definitely do it,
>>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>>
>>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>>> for sure since it's on shared code?
>>>
>>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>
>>> Thank you!
>>>
>>> Best regards,
>>> Gustavo
>>>
>>>
>>>> Regards,
>>>> Volker
>>>>
>>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>>> <gromero at linux.vnet.ibm.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Any update on it?
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards,
>>>>> Gustavo
>>>>>
>>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Could the following webrev be reviewed please?
>>>>>>
>>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>>> exist in the system.
>>>>>>
>>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>>
>>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>>
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>> node 0 size: 65258 MB
>>>>>> node 0 free: 34 MB
>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>> node 1 size: 65320 MB
>>>>>> node 1 free: 150 MB
>>>>>> node distances:
>>>>>> node   0   1
>>>>>>   0:  10  20
>>>>>>   1:  20  10,
>>>>>>
>>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>>> topology like:
>>>>>>
>>>>>> available: 4 nodes (0-1,16-17)
>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>> node 0 size: 130706 MB
>>>>>> node 0 free: 7729 MB
>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>> node 1 size: 0 MB
>>>>>> node 1 free: 0 MB
>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>> node 16 size: 130630 MB
>>>>>> node 16 free: 5282 MB
>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>> node 17 size: 0 MB
>>>>>> node 17 free: 0 MB
>>>>>> node distances:
>>>>>> node   0   1  16  17
>>>>>>   0:  10  20  40  40
>>>>>>   1:  20  10  40  40
>>>>>>  16:  40  40  10  20
>>>>>>  17:  40  40  20  10,
>>>>>>
>>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>>> no memory.
>>>>>>
>>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>>> messages:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>>
>>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>>> be available:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>>
>>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>>> performance.
>>>>>>
>>>>>> I found no regressions on x64 for the following numa topology:
>>>>>>
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>>> node 0 size: 24102 MB
>>>>>> node 0 free: 19806 MB
>>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>>> node 1 size: 24190 MB
>>>>>> node 1 free: 21951 MB
>>>>>> node distances:
>>>>>> node   0   1
>>>>>>   0:  10  21
>>>>>>   1:  21  10
>>>>>>
>>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Gustavo
>>>>>>
>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>>
>>>>>
>>>>
>>>
>>