Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Mon Feb 27 23:17:37 UTC 2017

Hi Gustabvo,

I am not in the PPC64 ML, so replying late. 

> 
>> After a due community review, could you sponsor that change?
Sure, I can sponsor this patch after the review. Please initiate the review on jdk 10 base. 

Thanks,
Sangheon

> On Feb 27, 2017, at 8:10 AM, David Holmes <david.holmes at oracle.com> wrote:
> 
> Hi Gustavo,
> 
> I am not a NUMA expert but it seems to me that our NUMA support is both incomplete and bit-rotting. It seems evident that UseNUMA is only working in limited contexts that match our testing environment. There were a couple of JEPS proposed to enhance NUMA support back in 2012:
> 
> JDK-8046147    JEP 157: G1 GC: NUMA-Aware Allocation
> JDK-8046153    JEP 163: Enable NUMA Mode by Default When Appropriate
> 
> but they have not progressed. If they were to progress then it seems our overall approach to NUMA would need serious review and update - as per your patch.
> 
> I'm also unclear about the distinctions between memory and non-memory nodes wrt. the existing os::Linux NUMA API. It isn't at all clear to me what functions should only be dealing with memory-nodes and which should deal with any kind eg. I expect cpu to node map to map cpu to nodes not cpu to nearest node with memory configured. If that is what is needed then the API's need to be changed and the usage checked - aas that distinction does not presently exist in the code AFAICS.
> 
> It is too late to take this patch into 9 IMHO as we don't have the ability to test it effectively, nor is there time for NUMA users to put it through its paces. I think this would have to be part of a bigger NUMA project for 10 that addresses the NUMA API and how it is used.
> 
> Thanks,
> David
> 
>> On 24/02/2017 10:02 PM, Gustavo Romero wrote:
>> Hi Sangheon,
>> 
>> Please find my comments inline.
>> 
>>> On 06-02-2017 20:23, sangheon wrote:
>>> Hi Gustavo,
>>> 
>>>> On 02/06/2017 01:50 PM, Gustavo Romero wrote:
>>>> Hi,
>>>> 
>>>> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>>>>  exactly the same as reported for x64 [1]:
>>>> 
>>>> [root at spocfire3 ~]# java -XX:+UseNUMA -version
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> openjdk version "1.8.0_121"
>>>> OpenJDK Runtime Environment (build 1.8.0_121-b13)
>>>> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>>>> 
>>>> [root at spocfire3 ~]# uname -a
>>>> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>>>> 
>>>> [root at spocfire3 ~]# lscpu
>>>> Architecture:          ppc64le
>>>> Byte Order:            Little Endian
>>>> CPU(s):                160
>>>> On-line CPU(s) list:   0-159
>>>> Thread(s) per core:    8
>>>> Core(s) per socket:    10
>>>> Socket(s):             2
>>>> NUMA node(s):          2
>>>> Model:                 2.0 (pvr 004d 0200)
>>>> Model name:            POWER8 (raw), altivec supported
>>>> L1d cache:             64K
>>>> L1i cache:             32K
>>>> L2 cache:              512K
>>>> L3 cache:              8192K
>>>> NUMA node0 CPU(s):     0-79
>>>> NUMA node8 CPU(s):     80-159
>>>> 
>>>> On chasing down it, looks like it comes from PSYoungGen::initialize() in
>>>> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
>>>> initialize_work(), that calls the MutableNUMASpace() constructor if
>>>> UseNUMA is set:
>>>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>>>> 
>>>> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
>>>> numa_set_bind_policy() in libnuma.so.1 [2].
>>>> 
>>>> I've traced some values for which mbind() syscall fails:
>>>> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>>>> 
>>>> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>>>> 
>>>> - Is there any WIP or known workaround?
>>> There's no progress on JDK-8163796 and no workaround found yet.
>>> And unfortunately, I'm not planning to fix it soon.
>> 
>> Hive, a critical component of Hadoop ecosystem, comes with a shell and uses java
>> (with UseNUMA flag) in the background to run mysql-kind of queries. On PPC64 the
>> mbind() messages in question make the shell pretty cumbersome. For instance:
>> 
>> hive> show databases;
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument (repeat message more 28 times...)
>> ...
>> OK
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> default
>> tpcds_bin_partitioned_orc_10
>> tpcds_text_10
>> Time taken: 1.036 seconds, Fetched: 3 row(s)
>> hive> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> 
>> Also, on PPC64 a simple "java -XX:+UseParallelGC -XX:+UseNUMA -version" will
>> trigger the problem, without any additional flags. So I'd like to correct that
>> behavior (please see my next comment on that).
>> 
>> 
>>>> - Should I append this output in [1] description or open a new one and make it
>>>>   related to" [1]?
>>> I think your problem seems same as JDK-8163796, so adding your output on the CR seems good.
>>> And please add logs as well. I recommend to enabling something like "-Xlog:gc*,gc+heap*=trace".
>>> IIRC, the problem was only occurred when the -Xmx was small in my case.
>> 
>> JVM code used to discover which numa nodes it can bind assumes that nodes are
>> consecutive and tries to bind from 0 to numa_max_node() [1, 2, 3, 4], i.e. from
>> 0 to the highest node number available on the system. However, at least on PPC64
>> that assumption is not always true. For instance, consider the following numa
>> topology:
>> 
>> available: 4 nodes (0-1,16-17)
>> node 0 cpus: 0 8 16 24 32
>> node 0 size: 130706 MB
>> node 0 free: 145 MB
>> node 1 cpus: 40 48 56 64 72
>> node 1 size: 0 MB
>> node 1 free: 0 MB
>> node 16 cpus: 80 88 96 104 112
>> node 16 size: 130630 MB
>> node 16 free: 529 MB
>> node 17 cpus: 120 128 136 144 152
>> node 17 size: 0 MB
>> node 17 free: 0 MB
>> node distances:
>> node   0   1  16  17
>>  0:  10  20  40  40
>>  1:  20  10  40  40
>> 16:  40  40  10  20
>> 17:  40  40  20  10
>> 
>> In that case we have four nodes, 2 without memory (1 and 17), where the
>> highest node id is 17. Hence if the JVM tries to bind from 0 to 17, mbind() will
>> fail except for nodes 0 and 16, which are configured and have memory. mbind()
>> failures will generate the "mbind: Invalid argument" messages.
>> 
>> A solution would be to use in os::numa_get_group_num() not numa_max_node() but
>> instead numa_num_configured_nodes() which returns the total number of nodes with
>> memory in the system (so in our example above it will return exactly 2 nodes)
>> and then inspect numa_all_node_ptr in os::numa_get_leaf_groups() to find the
>> correct node ids to append (in our case, map ids[0] = 0 [node 0] and ids[1] = 16
>> [node 16]).
>> 
>> One thing is that os::numa_get_leaf_groups() argument "size" will not be
>> required anymore and will be loose, so the interface will have to be adapted on
>> other OSs besides Linux I guess [5].
>> 
>> It would be necessary to adapt os::Linux::rebuild_cpu_to_node_map()
>> since not all numa nodes are suitable to be returned by a call to
>> os::numa_get_group_id() as some cpus would be in a node without memory.
>> In that case we can return the closest numa node instead. A new way to translate
>> indices to nodes is also useful since nodes are not always consecutive.
>> 
>> Finally, although "numa_nodes_ptr" is not present in libnuma's manual it's what
>> is used in numactl to find out the total number of nodes in the system [6]. I
>> could not find a function that would return that number readily. I asked to
>> libnuma ML if a better solution exists [7].
>> 
>> The following webrev implements the proposed changes on jdk9 (backport to 8 is
>> simple):
>> 
>> webrev: http://cr.openjdk.java.net/~gromero/8175813/
>> bug:    https://bugs.openjdk.java.net/browse/JDK-8175813
>> 
>> Here are the logs with "-Xlog:gc*,gc+heap*=trace":
>> 
>> http://cr.openjdk.java.net/~gromero/logs/pristine.log     (current state)
>> http://cr.openjdk.java.net/~gromero/logs/numa_patched.log (proposed change)
>> 
>> I've tested on 8 against SPECjvm2008 on the aforementioned machine and
>> performance improved ~5% in comparison to the same version packaged by
>> the distro, but I don't expect any difference on machines where nodes
>> are always consecutive and where nodes always have memory.
>> 
>> After a due community review, could you sponsor that change?
>> 
>> Thank you.
>> 
>> 
>> Best regards,
>> Gustavo
>> 
>> [1] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l241
>> [2] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2745
>> [3] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l243
>> [4] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2761
>> [5] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/runtime/os.hpp#l356
>> [6] https://github.com/numactl/numactl/blob/master/numactl.c#L251
>> [7] http://www.spinics.net/lists/linux-numa/msg01173.html
>> 
>>> 
>>> Thanks,
>>> Sangheon
>>> 
>>> 
>>>> 
>>>> Thank you.
>>>> 
>>>> 
>>>> Best regards,
>>>> Gustavo
>>>> 
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
>>>> [2] https://da.gd/4vXF
>>>> 
>>> 
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/attachments/20170228/1e4fd933/attachment-0001.html>