Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Fri Feb 24 12:02:31 UTC 2017

Hi Sangheon,

Please find my comments inline.

On 06-02-2017 20:23, sangheon wrote:
> Hi Gustavo,
> 
> On 02/06/2017 01:50 PM, Gustavo Romero wrote:
>> Hi,
>>
>> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>>   exactly the same as reported for x64 [1]:   
>>
>> [root at spocfire3 ~]# java -XX:+UseNUMA -version
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> openjdk version "1.8.0_121"
>> OpenJDK Runtime Environment (build 1.8.0_121-b13)
>> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>>
>> [root at spocfire3 ~]# uname -a
>> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>>
>> [root at spocfire3 ~]# lscpu
>> Architecture:          ppc64le
>> Byte Order:            Little Endian
>> CPU(s):                160
>> On-line CPU(s) list:   0-159
>> Thread(s) per core:    8
>> Core(s) per socket:    10
>> Socket(s):             2
>> NUMA node(s):          2
>> Model:                 2.0 (pvr 004d 0200)
>> Model name:            POWER8 (raw), altivec supported
>> L1d cache:             64K
>> L1i cache:             32K
>> L2 cache:              512K
>> L3 cache:              8192K
>> NUMA node0 CPU(s):     0-79
>> NUMA node8 CPU(s):     80-159
>>
>> On chasing down it, looks like it comes from PSYoungGen::initialize() in
>> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
>> initialize_work(), that calls the MutableNUMASpace() constructor if
>> UseNUMA is set:
>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>>
>> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
>> numa_set_bind_policy() in libnuma.so.1 [2].
>>
>> I've traced some values for which mbind() syscall fails:
>> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>>
>> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>>
>> - Is there any WIP or known workaround?
> There's no progress on JDK-8163796 and no workaround found yet.
> And unfortunately, I'm not planning to fix it soon.

Hive, a critical component of Hadoop ecosystem, comes with a shell and uses java
(with UseNUMA flag) in the background to run mysql-kind of queries. On PPC64 the
mbind() messages in question make the shell pretty cumbersome. For instance:

hive> show databases;
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument (repeat message more 28 times...)
...
OK
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
default
tpcds_bin_partitioned_orc_10
tpcds_text_10
Time taken: 1.036 seconds, Fetched: 3 row(s)
hive> mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument

Also, on PPC64 a simple "java -XX:+UseParallelGC -XX:+UseNUMA -version" will
trigger the problem, without any additional flags. So I'd like to correct that
behavior (please see my next comment on that).

>> - Should I append this output in [1] description or open a new one and make it
>>    related to" [1]?
> I think your problem seems same as JDK-8163796, so adding your output on the CR seems good.
> And please add logs as well. I recommend to enabling something like "-Xlog:gc*,gc+heap*=trace".
> IIRC, the problem was only occurred when the -Xmx was small in my case.

JVM code used to discover which numa nodes it can bind assumes that nodes are
consecutive and tries to bind from 0 to numa_max_node() [1, 2, 3, 4], i.e. from
0 to the highest node number available on the system. However, at least on PPC64
that assumption is not always true. For instance, consider the following numa
topology:

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 8 16 24 32
node 0 size: 130706 MB
node 0 free: 145 MB
node 1 cpus: 40 48 56 64 72
node 1 size: 0 MB
node 1 free: 0 MB
node 16 cpus: 80 88 96 104 112
node 16 size: 130630 MB
node 16 free: 529 MB
node 17 cpus: 120 128 136 144 152
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node   0   1  16  17
  0:  10  20  40  40
  1:  20  10  40  40
 16:  40  40  10  20
 17:  40  40  20  10

In that case we have four nodes, 2 without memory (1 and 17), where the
highest node id is 17. Hence if the JVM tries to bind from 0 to 17, mbind() will
fail except for nodes 0 and 16, which are configured and have memory. mbind()
failures will generate the "mbind: Invalid argument" messages.

A solution would be to use in os::numa_get_group_num() not numa_max_node() but
instead numa_num_configured_nodes() which returns the total number of nodes with
memory in the system (so in our example above it will return exactly 2 nodes)
and then inspect numa_all_node_ptr in os::numa_get_leaf_groups() to find the
correct node ids to append (in our case, map ids[0] = 0 [node 0] and ids[1] = 16
[node 16]).

One thing is that os::numa_get_leaf_groups() argument "size" will not be
required anymore and will be loose, so the interface will have to be adapted on
other OSs besides Linux I guess [5].

It would be necessary to adapt os::Linux::rebuild_cpu_to_node_map()
since not all numa nodes are suitable to be returned by a call to
os::numa_get_group_id() as some cpus would be in a node without memory.
In that case we can return the closest numa node instead. A new way to translate
indices to nodes is also useful since nodes are not always consecutive.

Finally, although "numa_nodes_ptr" is not present in libnuma's manual it's what
is used in numactl to find out the total number of nodes in the system [6]. I
could not find a function that would return that number readily. I asked to
libnuma ML if a better solution exists [7].

The following webrev implements the proposed changes on jdk9 (backport to 8 is
simple):

webrev: http://cr.openjdk.java.net/~gromero/8175813/
bug:    https://bugs.openjdk.java.net/browse/JDK-8175813

Here are the logs with "-Xlog:gc*,gc+heap*=trace":

http://cr.openjdk.java.net/~gromero/logs/pristine.log     (current state)
http://cr.openjdk.java.net/~gromero/logs/numa_patched.log (proposed change)

I've tested on 8 against SPECjvm2008 on the aforementioned machine and
performance improved ~5% in comparison to the same version packaged by
the distro, but I don't expect any difference on machines where nodes
are always consecutive and where nodes always have memory.

After a due community review, could you sponsor that change?

Thank you.

Best regards,
Gustavo

[1] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l241
[2] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2745
[3] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l243
[4] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2761
[5] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/runtime/os.hpp#l356
[6] https://github.com/numactl/numactl/blob/master/numactl.c#L251
[7] http://www.spinics.net/lists/linux-numa/msg01173.html

> 
> Thanks,
> Sangheon
> 
> 
>>
>> Thank you.
>>
>>
>> Best regards,
>> Gustavo
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
>> [2] https://da.gd/4vXF
>>
>