[10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Wed May 3 13:27:08 UTC 2017

Hi community,

I understand that there is nothing that can be done additionally regarding this
issue, at this point, on the PPC64 side.

It's a change in the shared code - but that in effect does not change anything in
the numa detection mechanism for other platforms - and hence it's necessary a
conjoint community effort to review the change and a sponsor to run it against
the JPRT.

I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
great concern on PPC64 (specially on POWER8 machines) I would be very glad if
the community could point out directions on how that change could move on.

Thank you!

Best regards,
Gustavo

On 25-04-2017 23:49, Gustavo Romero wrote:
> Dear Volker,
> 
> On 24-04-2017 14:08, Volker Simonis wrote:
>> Hi Gustavo,
>>
>> thanks for addressing this problem and sorry for my late reply. I
>> think this is a good change which definitely improves the situation
>> for uncommon NUMA configurations without changing the handling for
>> common topologies.
> 
> Thanks a lot for reviewing the change!
> 
> 
>> It would be great if somebody could run this trough JPRT, but as
>> Gustavo mentioned, I don't expect any regressions.
>>
>> @Igor: I think you've been the original author of the NUMA-aware
>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>> linux"). If you could find some spare minutes to take a look at this
>> change, your comment would be very much appreciated :)
>>
>> Following some minor comments from me:
>>
>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>> to get the actual number of configured nodes. This is good and
>> certainly an improvement over the previous implementation. However,
>> the man page for numa_num_configured_nodes() mentions that the
>> returned count may contain currently disabled nodes. Do we currently
>> handle disabled nodes? What will be the consequence if we would use
>> such a disabled node (e.g. mbind() warnings)?
> 
> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
> number of nodes with memory in the system. To the best of my knowledge there is
> no system configuration on Linux/PPC64 that could match such a notion of
> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
> that dir and just the ones with memory will be taken into account. If it's
> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
> mbind() tried against it).
> 
> On Power it's possible to have a numa node without memory (memory-less node, a
> case covered in this change), a numa node without cpus at all but with memory
> (a configured node anyway, so a case already covered) but to disable a specific
> numa node so it does not appear in /sys/devices/system/node/* it's only possible
> from the inners of the control module. Or other rare condition not invisible /
> adjustable from the OS. Also I'm not aware of a case where a node is in this
> dir but is at the same time flagged as something like "disabled". There are
> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
> 
> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
> 
> 
>> - the same question applies to the usage of
>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>> this be a potential problem (i.e. if we use a disabled node).
> 
> On the meaning of "disabled nodes", it's the same case as above, so to the
> best of knowledge it's not a potential problem.
> 
> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
> i.e. "all nodes on which the calling task may allocate memory". It's exactly
> the same pointer returned by numa_get_membind() v2 [3] which:
> 
> "returns the mask of nodes from which memory can currently be allocated"
> 
> and that is used, for example, in "numactl --show" to show nodes from where
> memory can be allocated [4, 5].
> 
> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
> 
> 
>> - I'd like to suggest renaming the 'index' part of the following
>> variables and functions to 'nindex' ('node_index' is probably to long)
>> in the following code, to emphasize that we have node indexes pointing
>> to actual, not always consecutive node numbers:
>>
>> 2879         // Create an index -> node mapping, since nodes are not
>> always consecutive
>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>> GrowableArray<int>(0, true);
>> 2881         rebuild_index_to_node_map();
> 
> Simple change but much better to read indeed. Done.
> 
> 
>> - can you please wrap the following one-line else statement into curly
>> braces (it's more readable and we usually do it that way in HotSpot
>> although there are no formal style guidelines :)
>>
>> 2953      } else
>> 2954        // Current node is already a configured node.
>> 2955        closest_node = index_to_node()->at(i);
> 
> Done.
> 
> 
>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>> later avoid the check for '|| !closest_distance'. Also, according to
>> the man page, numa_distance() returns 0 if it can not determine the
>> distance. So with the above change, the condition on line 2974 should
>> read:
>>
>> 2947           if (distance && distance < closest_distance) {
>>
> 
> Sure, much better to set the initial condition as distant as possible and
> adjust to a closer one bit by bit improving the if condition. Done.
> 
> 
>> Finally, and not directly related to your change, I'd suggest the
>> following clean-ups:
>>
>> - remove the usage of 'NCPUS = 32768' in
>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>> unclear to me and probably related to an older version/problem of
>> libnuma? I think we should simply use
>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>
>> - we still use the NUMA version 1 function prototypes (e.g.
>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>> also "numa_interleave_memory()" and maybe others). I think we should
>> switch all prototypes to the new NUMA version 2 interface which you've
>> already used for the new functions which you've added.
> 
> I agree. Could I open a new bug to address these clean-ups?
> 
> 
>> That said, I think these changes all require libnuma 2.0 (see
>> os::Linux::libnuma_dlsym). So before starting this, you should make
>> sure that libnuma 2.0 is available on all platforms to which you'd
>> like to down-port this change. For jdk10 we could definitely do it,
>> for jdk9 probably also, for jdk8 I'm not so sure.
> 
> libnuma v1 last release dates back to 2008, but any idea how could I check that
> for sure since it's on shared code?
> 
> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
> 
> Thank you!
> 
> Best regards,
> Gustavo
> 
> 
>> Regards,
>> Volker
>>
>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>> <gromero at linux.vnet.ibm.com> wrote:
>>> Hi,
>>>
>>> Any update on it?
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Gustavo
>>>
>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>> Hi,
>>>>
>>>> Could the following webrev be reviewed please?
>>>>
>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>> exist in the system.
>>>>
>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>
>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>> consecutive and have memory, for example in a numa topology like:
>>>>
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 65258 MB
>>>> node 0 free: 34 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 65320 MB
>>>> node 1 free: 150 MB
>>>> node distances:
>>>> node   0   1
>>>>   0:  10  20
>>>>   1:  20  10,
>>>>
>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>> topology like:
>>>>
>>>> available: 4 nodes (0-1,16-17)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 130706 MB
>>>> node 0 free: 7729 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 0 MB
>>>> node 1 free: 0 MB
>>>> node 16 cpus: 80 88 96 104 112
>>>> node 16 size: 130630 MB
>>>> node 16 free: 5282 MB
>>>> node 17 cpus: 120 128 136 144 152
>>>> node 17 size: 0 MB
>>>> node 17 free: 0 MB
>>>> node distances:
>>>> node   0   1  16  17
>>>>   0:  10  20  40  40
>>>>   1:  20  10  40  40
>>>>  16:  40  40  10  20
>>>>  17:  40  40  20  10,
>>>>
>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>> no memory.
>>>>
>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>> id as a hint that is not available in the system to be bound (it will receive
>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>> messages:
>>>>
>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>
>>>> That change improves the detection by making the JVM numa API aware of the
>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>> be available:
>>>>
>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>
>>>> The change has no effect on numa topologies were the problem does not occur,
>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>> to the closest node, otherwise they would be not associate to any node and
>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>> performance.
>>>>
>>>> I found no regressions on x64 for the following numa topology:
>>>>
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>> node 0 size: 24102 MB
>>>> node 0 free: 19806 MB
>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>> node 1 size: 24190 MB
>>>> node 1 free: 21951 MB
>>>> node distances:
>>>> node   0   1
>>>>   0:  10  21
>>>>   1:  21  10
>>>>
>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>
>>>
>>
>