[ping] Re: [11] RFR(M): 8189922: UseNUMA memory interleaving vs membind

Gustavo Romero gromero at linux.vnet.ibm.com
Wed Jul 18 22:46:02 UTC 2018

Hi Community,

In the light of the additional information brought by Derek and Swati on this
issue (which I summarized below) would it be possible to get a second assessment
for this issue regarding its priority? Would it be sound to qualify it as P3
instead of P4? I don't have experience on doing such an assessment so I ask.

- If the JVM isn't allowed to use memory on all of the numa nodes, for instance,
   by numactl, cgroups, and docker container, then a significant fraction of the
   JVM heap will be unusable, causing early GC;

- The waste can be up to (N-1)/N of the young generation in some cases, where N
   is the total number of nodes available on the system (unpinned); So on an EPYC
   machine with 8 numa nodes, for instance, waste can be up to 7/8 of total memory
   available on a given node;

- With the patch for this issue applied SPECjbb2015 on a EPYC machine (8 NUMA
   nodes) shows a significant performance improvement:

   Case 1: Performance for 1 NUMA node bound, MultiJVM 1 Group Run (numactl --cpunodebind=0 --membind=0)
     Max-jOPS      : +27.59%
     Critical-jOPS : +260% (As base memory without patch is 1/8 of total available memory, heap size impacts Critical-jOPS)

   Case 2: Performance for 2 NUMA nodes bound, Composite Run (numactl --cpunodebind=0,7 --membind=0,7)
     Max-jOPS      : +10.35%
     Critical-jOPS : +9.89%

- It affects AARCH64, PPC64, and Intel/AMD.

Thank you.

Best regards,

On 07/18/2018 03:32 AM, Swati Sharma wrote:
> Hi All,
> Significant performance improvement with the patch we have observed for SPECJBB2015:
> Tested the specjbb2015 with patch on EPYC( 8 NUMA nodes configuration )
> Case 1 : Performance On 1 NUMA Node MultiJVM 1 Group Run (numactl --cpunodebind=0 --membind=0)
>     Max-jOPS         Critical-jOPS
>    +27.59%             +260%                     (As base memory without patch is 1/8 of Total, heap size impacts Critical -jOPS)
> Case 2 : Performance on 2 NUMA Nodes Composite Run (numactl --cpunodebind=0,7 --membind=0,7)
>    Max-jOPS          Critical-jOPS
>     +10.35%             +9.89%
> If more information is required for different configurations on EPYC, Please let me know.
> Thanks,
> Swati Sharma
> Software Engineer -2 at AMD
> On Thu, Jul 12, 2018 at 3:36 AM, Gustavo Romero <gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>> wrote:
>     Hi,
>     I concur with Derek about the impact on the GC.
>     I can't comment about the impact on the container context.
>     I can try to take additional performance numbers on that, if that helps to
>     define the priority (is there any available?).
>     Prakash, Swati,
>     It's not necessary to stop improving the JVM NUMA support. jdk/jdk (JDK 12)
>     repository is open and the next changes can go there normally, i.e. even if
>     8189922 is not able to get into JDK 11 (jdk/jdk11), it can go into jdk/jdk and
>     so we move on. Also I don't think it's possible to get the additional changes
>     into JDK 11 anyway.
>     On the other hand, since 11 is LTS, I _think_ that OpenJDK 11 LTS will be open
>     for updates after the GA and so it's possible to request the backports, as
>     Dalibor [1] says: "[After September 2018] Oracle plans to initiate and
>     contribute to the development JDK 11 Updates with this OpenJDK Project in due
>     course."
>     Best regards,
>     Gustavo
>     [1] http://mail.openjdk.java.net/pipermail/jdk-updates-dev/2018-May/000128.html <http://mail.openjdk.java.net/pipermail/jdk-updates-dev/2018-May/000128.html>
>     On 07/11/2018 01:46 AM, Raghavendra, Prakash wrote:
>         Apart from the impact on EPYC (with large memory), we have also have identified a few fixes on top of this, which is waiting for this to go. The one which is very essential is the case when GC is triggered prematurely when one of the lgrps is full, the fix is to see if we can try and allocate on another closest (in terms of latencies) lgrps, instead.
>         Thanks
>         Regards
>         Prakash
>         -----Original Message-----
>         From: White, Derek [mailto:Derek.White at cavium.com <mailto:Derek.White at cavium.com>]
>         Sent: Wednesday, July 11, 2018 5:25 AM
>         To: David Holmes <david.holmes at oracle.com <mailto:david.holmes at oracle.com>>; Gustavo Romero <gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>>; Swati Sharma <swatibits14 at gmail.com <mailto:swatibits14 at gmail.com>>; Alan.Bateman at oracle.com <mailto:Alan.Bateman at oracle.com>
>         Cc: Vishwanath, Prasad <Prasad.Vishwanath at amd.com <mailto:Prasad.Vishwanath at amd.com>>; hotspot-dev at openjdk.java.net <mailto:hotspot-dev at openjdk.java.net>; Raghavendra, Prakash <Prakash.Raghavendra at amd.com <mailto:Prakash.Raghavendra at amd.com>>
>         Subject: RE: [ping] Re: [11] RFR(M): 8189922: UseNUMA memory interleaving vs membind
>         Hi David,
>         [Gustavo and others, please correct me if I'm wrong].
>         I think as far as impact goes, this may waste up to (N-1)/N of the young generation in some cases, where N is number of nodes. For example, on 2-socket Epyc, this could waste 7/8ths of the young gen. This has non-trivial impact on GC behavior ��
>         This occurs when there's a mismatch between memory binding and JVM arguments (or rather can be avoided by carefully setting JVM arguments), and I could see this happening easily when using containers.
>         - Derek
>             -----Original Message-----
>             From: hotspot-dev [mailto:hotspot-dev-bounces at openjdk.java.net <mailto:hotspot-dev-bounces at openjdk.java.net>] On
>             Behalf Of David Holmes
>             Sent: Tuesday, July 10, 2018 5:39 PM
>             To: Gustavo Romero <gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>>; Swati Sharma
>             <swatibits14 at gmail.com <mailto:swatibits14 at gmail.com>>; Alan.Bateman at oracle.com <mailto:Alan.Bateman at oracle.com>
>             Cc: Prasad.Vishwanath at amd.com <mailto:Prasad.Vishwanath at amd.com>; hotspot-dev at openjdk.java.net <mailto:hotspot-dev at openjdk.java.net>;
>             Prakash.Raghavendra at amd.com <mailto:Prakash.Raghavendra at amd.com>
>             Subject: Re: [ping] Re: [11] RFR(M): 8189922: UseNUMA memory
>             interleaving vs membind
>             External Email
>             Hi Gustavo,
>             On 11/07/2018 6:14 AM, Gustavo Romero wrote:
>                 Hi Swati,
>                 As David pointed out, it's necessary to determine if that bug
>                 qualifies as P3 in order to get it into JDK 11 RDP1.
>                 AFAICS, that bug was never triaged explicitly and got its current
>                 priority (P4) from the default.
>             Actually no, the P4 was from the (Oracle internal) ILW prioritization scheme.
>             For this to be a P3 it needs to be shown either that the impact is
>             quite significant (IIUC it's only a mild performance issue based on
>             the bug report); or that the likelihood of this being encountered is
>             very high (again it seems not that likely based on the info in the bug report).
>             HTH.
>             David
>             -----
>                 Once it's defined the correct integration version, I can sponsor
>                 that change for you. I think there won't be any updates for JDK 11
>                 (contrary to what happened for JDK 10), but I think we can
>                 understand how distros are handling it and so find out if there is a
>                 possibility to get the change into the distros once it's pushed to JDK 12.
>                 David, Alan,
>                 I could not find a documentation on how to formally triage a bug.
>                 For instance, on [1] I see Alan used some markers as "ILW =" and "MLH = "
>                 but I don't know if these markers are only for Oracle internal
>                 control. Do you know how could I triage that bug? I understand its
>                 risk of integration is small but even tho I think it's necessary to
>                 bring up additional information on that to combine in a final bug
>                 priority.
>                 Thanks.
>                 Best regards,
>                 Gustavo
>                 [1] https://bugs.openjdk.java.net/browse/JDK-8206953 <https://bugs.openjdk.java.net/browse/JDK-8206953>
>                 On 07/03/2018 03:06 AM, David Holmes wrote:
>                     Looks fine.
>                     Thanks,
>                     David
>                     On 3/07/2018 3:08 PM, Swati Sharma wrote:
>                         Hi David,
>                         I have added NULL check for _numa_bitmask_isbitset in
>                         isbound_to_single_node() method.
>                         Hosted:http://cr.openjdk.java.net/~gromero/8189922/v2/ <http://cr.openjdk.java.net/~gromero/8189922/v2/>
>                         <http://cr.openjdk.java.net/~gromero/8189922/v2/ <http://cr.openjdk.java.net/~gromero/8189922/v2/>>
>                         Swati
>                         On Mon, Jul 2, 2018 at 5:54 AM, David Holmes
>                         <david.holmes at oracle.com <mailto:david.holmes at oracle.com> <mailto:david.holmes at oracle.com <mailto:david.holmes at oracle.com>>> wrote:
>                               Hi Swati,
>                               I took a look at this though I'm not familiar with the functional
>                               operation of the NUMA API's - I'm relying on Gustavo and Derek to
>                               spot any actual usage errors there.
>                               In isbound_to_single_node() there is no NULL check for
>                               _numa_bitmask_isbitset (which seems to be the normal pattern for
>                               using all of these function pointers).
>                               Otherwise this seems fine.
>                               Thanks,
>                               David
>                               On 30/06/2018 2:46 AM, Swati Sharma wrote:
>                                      Hi,
>                                   Could I get a review for this change that affects the JVM when
>                                   there are
>                                   pinned memory nodes please?
>                                   It's already reviewed and tested on PPC64 and on AARCH64 by
>                                   Gustavo and
>                                   Derek, however both are not Reviewers so I need additional
>                                   reviews for that
>                                   change.
>                                   Thanks in advance.
>                                   Swati
>                                   On Tue, Jun 19, 2018 at 5:58 PM, Swati Sharma
>                                   <swatibits14 at gmail.com <mailto:swatibits14 at gmail.com> <mailto:swatibits14 at gmail.com <mailto:swatibits14 at gmail.com>>> wrote:
>                                       Hi All,
>                                       Here is the numa information of the system :
>                                       swati at java-diesel1:~$ numactl -H
>                                       available: 8 nodes (0-7)
>                                       node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
>                                       node 0 size: 64386 MB
>                                       node 0 free: 64134 MB
>                                       node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
>                                       node 1 size: 64509 MB
>                                       node 1 free: 64232 MB
>                                       node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
>                                       node 2 size: 64509 MB
>                                       node 2 free: 64215 MB
>                                       node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
>                                       node 3 size: 64509 MB
>                                       node 3 free: 64157 MB
>                                       node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100
>                         101
>                         102 103
>                                       node 4 size: 64509 MB
>                                       node 4 free: 64336 MB
>                                       node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109
>                                       110 111
>                                       node 5 size: 64509 MB
>                                       node 5 free: 64352 MB
>                                       node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117
>                                       118 119
>                                       node 6 size: 64509 MB
>                                       node 6 free: 64359 MB
>                                       node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125
>                                       126 127
>                                       node 7 size: 64508 MB
>                                       node 7 free: 64350 MB
>                                       node distances:
>                                       node   0   1   2   3   4   5   6   7
>                                           0:  10  16  16  16  32  32  32  32
>                                           1:  16  10  16  16  32  32  32  32
>                                           2:  16  16  10  16  32  32  32  32
>                                           3:  16  16  16  10  32  32  32  32
>                                           4:  32  32  32  32  10  16  16  16
>                                           5:  32  32  32  32  16  10  16  16
>                                           6:  32  32  32  32  16  16  10  16
>                                           7:  32  32  32  32  16  16  16  10
>                                       Thanks,
>                                       Swati
>                                       On Tue, Jun 19, 2018 at 12:00 AM, Gustavo Romero <
>                         gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>
>                                       <mailto:gromero at linux.vnet.ibm.com <mailto:gromero at linux.vnet.ibm.com>>> wrote:
>                                           Hi Swati,
>                                           On 06/16/2018 02:52 PM, Swati Sharma wrote:
>                                               Hi All,
>                                               This is my first patch,I would appreciate if anyone
>                                               can review the fix:
>                                               Bug :
>                         https://bugs.openjdk.java.net/browse/JDK-8189922 <https://bugs.openjdk.java.net/browse/JDK-8189922>
>                                               <https://bugs.openjdk.java.net/browse/JDK-8189922 <https://bugs.openjdk.java.net/browse/JDK-8189922>> <
>                         https://bugs.openjdk.java.net/browse/JDK-8189922 <https://bugs.openjdk.java.net/browse/JDK-8189922>
>                                               <https://bugs.openjdk.java.net/browse/JDK-8189922 <https://bugs.openjdk.java.net/browse/JDK-8189922>>>
>                                               Webrev
>                                               :http://cr.openjdk.java.net/~gromero/8189922/v1 <http://cr.openjdk.java.net/~gromero/8189922/v1>
>                                               <http://cr.openjdk.java.net/~gromero/8189922/v1 <http://cr.openjdk.java.net/~gromero/8189922/v1>>
>                                               The bug is about JVM flag UseNUMA which bypasses the
>                                               user specified
>                                               numactl --membind option and divides the whole heap
>                                               in lgrps according to
>                                               available numa nodes.
>                                               The proposed solution is to disable UseNUMA if bound
>                                               to single numa
>                                               node. In case more than one numa node binding,
>                                               create the lgrps according
>                                               to bound nodes.If there is no binding, then JVM will
>                                               divide the whole heap
>                                               based on the number of NUMA nodes available on the
>                                               system.
>                                               I appreciate Gustavo's help for fixing the thread
>                                               allocation based on
>                                               numa distance for membind which was a dangling issue
>                                               associated with main
>                                               patch.
>                                           Thanks. I have no further comments on it. LGTM.
>                                           Best regards,
>                                           Gustavo
>                                           PS: Please, provide numactl -H information when
>                                           possible. It helps to
>                                           grasp
>                                           promptly the actual NUMA topology in question :)
>                                           Tested the fix by running specjbb2015 composite workload
>                                           on 8 NUMA node
>                                               system.
>                                               Case 1 : Single NUMA node bind
>                                               numactl --cpunodebind=0 --membind=0 java -Xmx24g
>                                               -Xms24g -Xmn22g
>                                               -XX:+UseNUMA
>                                               -Xlog:gc*=debug:file=gc.log:time,uptimemillis
>                                               <composite_application>
>                                               Before Patch: gc.log
>                                               eden space 22511616K(22GB), 12% used
>                                                      lgrp 0 space 2813952K, 100% used
>                                                      lgrp 1 space 2813952K, 0% used
>                                                      lgrp 2 space 2813952K, 0% used
>                                                      lgrp 3 space 2813952K, 0% used
>                                                      lgrp 4 space 2813952K, 0% used
>                                                      lgrp 5 space 2813952K, 0% used
>                                                      lgrp 6 space 2813952K, 0% used
>                                                      lgrp 7 space 2813952K, 0% used
>                                               After Patch : gc.log
>                                               eden space 46718976K(45GB), 99% used(NUMA
>                         disabled)
>                                               Case 2 : Multiple NUMA node bind
>                                               numactl --cpunodebind=0,7 –membind=0,7 java -Xms50g
>                                               -Xmx50g -Xmn45g
>                                               -XX:+UseNUMA
>                                               -Xlog:gc*=debug:file=gc.log:time,uptimemillis
>                                               <composite_application>
>                                               Before Patch :gc.log
>                                               eden space 46718976K, 6% used
>                                                      lgrp 0 space 5838848K, 14% used
>                                                      lgrp 1 space 5838848K, 0% used
>                                                      lgrp 2 space 5838848K, 0% used
>                                                      lgrp 3 space 5838848K, 0% used
>                                                      lgrp 4 space 5838848K, 0% used
>                                                      lgrp 5 space 5838848K, 0% used
>                                                      lgrp 6 space 5838848K, 0% used
>                                                      lgrp 7 space 5847040K, 35% used
>                                               After Patch : gc.log
>                                               eden space 46718976K(45GB), 99% used
>                                                       lgrp 0 space 23359488K(23.5GB), 100% used
>                                                       lgrp 7 space 23359488K(23.5GB), 99%
>                         used
>                                               Note: The proposed solution is only for numactl
>                                               membind option.The fix
>                                               is not for --cpunodebind and localalloc which is a
>                                               separate bug bug
>                         https://bugs.openjdk.java.net/browse/JDK-8205051 <https://bugs.openjdk.java.net/browse/JDK-8205051>
>                                               <https://bugs.openjdk.java.net/browse/JDK-8205051 <https://bugs.openjdk.java.net/browse/JDK-8205051>>
>                                               and fix is in progress
>                                               on this.
>                                               Thanks,
>                                               Swati Sharma
>                                               Software Engineer -2 at AMD

More information about the hotspot-compiler-dev mailing list