[10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Mon Apr 24 17:08:41 UTC 2017

Hi Gustavo,

thanks for addressing this problem and sorry for my late reply. I
think this is a good change which definitely improves the situation
for uncommon NUMA configurations without changing the handling for
common topologies.

It would be great if somebody could run this trough JPRT, but as
Gustavo mentioned, I don't expect any regressions.

@Igor: I think you've been the original author of the NUMA-aware
allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
linux"). If you could find some spare minutes to take a look at this
change, your comment would be very much appreciated :)

Following some minor comments from me:

- in os::numa_get_groups_num() you now use numa_num_configured_nodes()
to get the actual number of configured nodes. This is good and
certainly an improvement over the previous implementation. However,
the man page for numa_num_configured_nodes() mentions that the
returned count may contain currently disabled nodes. Do we currently
handle disabled nodes? What will be the consequence if we would use
such a disabled node (e.g. mbind() warnings)?

- the same question applies to the usage of
Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
Does isnode_in_configured_nodes() (i.e. the node set defined by
'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
this be a potential problem (i.e. if we use a disabled node).

- I'd like to suggest renaming the 'index' part of the following
variables and functions to 'nindex' ('node_index' is probably to long)
in the following code, to emphasize that we have node indexes pointing
to actual, not always consecutive node numbers:

2879         // Create an index -> node mapping, since nodes are not
always consecutive
2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
GrowableArray<int>(0, true);
2881         rebuild_index_to_node_map();

- can you please wrap the following one-line else statement into curly
braces (it's more readable and we usually do it that way in HotSpot
although there are no formal style guidelines :)

2953      } else
2954        // Current node is already a configured node.
2955        closest_node = index_to_node()->at(i);

- in os::Linux::rebuild_cpu_to_node_map(), if you set
'closest_distance' to INT_MAX at the beginning of the loop, you can
later avoid the check for '|| !closest_distance'. Also, according to
the man page, numa_distance() returns 0 if it can not determine the
distance. So with the above change, the condition on line 2974 should
read:

2947           if (distance && distance < closest_distance) {

Finally, and not directly related to your change, I'd suggest the
following clean-ups:

- remove the usage of 'NCPUS = 32768' in
os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
unclear to me and probably related to an older version/problem of
libnuma? I think we should simply use
numa_allocate_cpumask()/numa_free_cpumask() instead.

- we still use the NUMA version 1 function prototypes (e.g.
"numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
also "numa_interleave_memory()" and maybe others). I think we should
switch all prototypes to the new NUMA version 2 interface which you've
already used for the new functions which you've added.

That said, I think these changes all require libnuma 2.0 (see
os::Linux::libnuma_dlsym). So before starting this, you should make
sure that libnuma 2.0 is available on all platforms to which you'd
like to down-port this change. For jdk10 we could definitely do it,
for jdk9 probably also, for jdk8 I'm not so sure.

Regards,
Volker

On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
<gromero at linux.vnet.ibm.com> wrote:
> Hi,
>
> Any update on it?
>
> Thank you.
>
> Regards,
> Gustavo
>
> On 09-03-2017 16:33, Gustavo Romero wrote:
>> Hi,
>>
>> Could the following webrev be reviewed please?
>>
>> It improves the numa node detection when non-consecutive or memory-less nodes
>> exist in the system.
>>
>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>
>> Currently, although no problem exists when the JVM detects numa nodes that are
>> consecutive and have memory, for example in a numa topology like:
>>
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 8 16 24 32
>> node 0 size: 65258 MB
>> node 0 free: 34 MB
>> node 1 cpus: 40 48 56 64 72
>> node 1 size: 65320 MB
>> node 1 free: 150 MB
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10,
>>
>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>> topology like:
>>
>> available: 4 nodes (0-1,16-17)
>> node 0 cpus: 0 8 16 24 32
>> node 0 size: 130706 MB
>> node 0 free: 7729 MB
>> node 1 cpus: 40 48 56 64 72
>> node 1 size: 0 MB
>> node 1 free: 0 MB
>> node 16 cpus: 80 88 96 104 112
>> node 16 size: 130630 MB
>> node 16 free: 5282 MB
>> node 17 cpus: 120 128 136 144 152
>> node 17 size: 0 MB
>> node 17 free: 0 MB
>> node distances:
>> node   0   1  16  17
>>   0:  10  20  40  40
>>   1:  20  10  40  40
>>  16:  40  40  10  20
>>  17:  40  40  20  10,
>>
>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>> no memory.
>>
>> If a topology like that exists, os::numa_make_local() will receive a local group
>> id as a hint that is not available in the system to be bound (it will receive
>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>> messages:
>>
>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>
>> That change improves the detection by making the JVM numa API aware of the
>> existence of numa nodes that are non-consecutive from 0 to the highest node
>> number and also of nodes that might be memory-less nodes, i.e. that might not
>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>> be available:
>>
>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>
>> The change has no effect on numa topologies were the problem does not occur,
>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>> numa topologies where memory-less nodes exist (like in the last example above),
>> cpus from a memory-less node won't be able to bind locally so they are mapped
>> to the closest node, otherwise they would be not associate to any node and
>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>> performance.
>>
>> I found no regressions on x64 for the following numa topology:
>>
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 8 9 10 11
>> node 0 size: 24102 MB
>> node 0 free: 19806 MB
>> node 1 cpus: 4 5 6 7 12 13 14 15
>> node 1 size: 24190 MB
>> node 1 free: 21951 MB
>> node distances:
>> node   0   1
>>   0:  10  21
>>   1:  21  10
>>
>> I understand that fixing the current numa detection is a prerequisite to enable
>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>
>> Thank you.
>>
>>
>> Best regards,
>> Gustavo
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>
>