[PATCH] JDK NUMA Interleaving issue

Thu Nov 29 11:08:14 UTC 2018

Hi Thomas,

Sorry for the late reply and please check my inline comments below.

On Thu, Nov 22, 2018 at 5:18 PM Thomas Schatzl <thomas.schatzl at oracle.com>
wrote:

> Hi Amith,
>
>   welcome to the OpenJDK community! :)
>
Thanks

>
>
On Fri, 2018-11-09 at 17:53 +0530, amith pawar wrote:
> > Hi all,
> >
> > The flag UseNUMA (or UseNUMAInterleaving), has to interleave old gen,
> > S1 and S2 region (if any other ) memory areas on requested Numa nodes
> > and it should not configure itself to access other Numa nodes. This
> > issue is observed only when Java is allowed to run on fewer NUMA
> > nodes than available on the system with numactl membind and
> > interleave options. Running on all the nodes does not have any
> > effect. This will cause some applications (objects residing in old
> > gen and survivor region) to run slower on system with large Numa
> > nodes.
> >
> [... long explanation...]
>
> Is it possible to summarize the problem and the changes with the
> following few lines:
>
> "NUMA interleaving of memory of old gen and survivor spaces (for
> parallel GC) tells the OS to interleave memory across all nodes of a
> NUMA system. However the VM process may be configured to be limited to
> run only on a few nodes, which means that large parts of the heap will
> be located on foreign nodes. This can incurs a large performance
> penalty.
>
> The proposed solution is to tell the OS to interleave memory only
> across available nodes when enabling NUMA."
>
OK. Since, I dont have write access to the defect link so can I request you
do the same.

>
> We have had trouble understanding the problem statement and purpose of
> this patch when triaging (making sure the issue is understandable and
> can be worked on) as the text is rather technical. Having an easily
> understandable text also helps reviewers a lot.
>

> Assuming my summary is appropriate, I have several other unrelated
> questions:
>
> - could you propose a better subject for this work? "JDK NUMA
> Interleaving issue" seems very generic. Something like "NUMA heap
> allocation does not respect VM membind/interleave settings" maybe?
>
This is also OK.

>
> - there have been other NUMA related patches from AMD recently, in
> particular JDK-what is the relation to JDK-8205051? The text there
> reads awfully similar to this one, but I *think* JDK-8205051 is
> actually about making sure that the parallel gc eden is not put on
> inactive nodes.
> Can you confirm this (talk to your colleague) so that we can fix the
> description too?
>
JDK-8205051  is related to early GC.
JDK-8189922 specific to membind case where heap was allocated on non
requested NUMA nodes.
The proposed patch fixes two following issues
1. Similar to JDK-8189922 but for interleave case.
2. OLDGen, S1 and S2 memory interleaving is incorrect when run on fewer
NUMA nodes.


> - fyi, we are currently working on NUMA aware memory allocation support
> for G1 in JEP 345 (JDK-8210473). It may be useful to sync up a bit to
> not step on each other's toes (see below).
>
We are not working anything related to G1. It may effect G1 if
numa_make_global function is called.

>
> [... broken patch...]
>
> I tried to apply the patch to the jdk/jdk tree, which failed; I then
> started to manually apply it but stopped after not being able to find
> the context of some hunks. I do not think this change applies to the
> latest source tree.
>
> Please make sure that the patch applies to the latest jdk/jdk tree with
> errors. All changes generally must first go into the latest dev tree
> before you can apply for backporting.
>
> Could you please send the patch as attachment (not copy&pasted) to this
> list and cc me? Then I can create a webrev out of it.
>
> I did look a bit over the patch as much as I could (it's very hard
> trying to review a diff), some comments:
>
Sorry. Please check the attached patch.

>
>   - the _numa_interleave_memory_v2 function pointer is never assigned
> to in the patch in the CR, so it will not be used. Please make sure the
> patch is complete.
> Actually it is never defined anywhere, ie. the patch unlikely actually
> compiles even if I could apply it somewhere.
>
> Please avoid frustrating reviewers by sending incomplete patches.
>
Sorry again. Faced same issue when copied from the defect link but worked
for me from the mail. Now i have attached the patch. Can you please update
defect link with this patch ?

>
>   - I am not sure set_numa_interleave, i.e. the interleaving, should be
> done without UseNUMAInterleaving enabled. Some collectors may do their
> own interleaving in the future (JDK-8210473 / JEP-345) so this would
> massively interfere in how they work. (That issue may be because I am
> looking at a potentially incomplete diff, so forgive me if the patch
> already handles this).
>
Testing SPECjbb with UseNUMAInterleaving enabled/disabled has no effect
when java externally invoked in interleave mode. It helps membind case.

>
>   - if some of the actions (interleaving/membind) fail, and it had been
> requested, it would be nice to print a log message.
>
Following two NUMA API's are used and they return nothing. Right now not
sure which cases to handle. Please suggest
void numa_tonode_memory(void *start, size_t size, int node);
void numa_interleave_memory(void *start, size_t size, struct bitmask
*nodemask);

>
> Actually it would be nice to print information about e.g. the bitmask
> anyway in the log so that the user can easily verify that what he
> specified on the command line has been properly picked up by the VM -
> instead of looking through the proc filesystem.
>
As per the suggestion, patch is updated to print following information.
[0.001s][info][os] UseNUMA is enabled
[0.001s][info][os]   Java is configured to run in interleave mode
[0.001s][info][os]   Heap will be configured using NUMA memory nodes: 0, 2,
3

>
> Thanks,
>   Thomas
>
>
>
>
Thanks
Amit Pawar
-------------- next part --------------
diff -r d537553ed639 src/hotspot/os/linux/os_linux.cpp

--- a/src/hotspot/os/linux/os_linux.cpp	Wed Nov 28 22:29:35 2018 -0500
+++ b/src/hotspot/os/linux/os_linux.cpp	Thu Nov 29 16:56:15 2018 +0530
@@ -2724,6 +2724,8 @@
 }
 
 void os::numa_make_global(char *addr, size_t bytes) {
+  if (!UseNUMAInterleaving)
+    return ;
   Linux::numa_interleave_memory(addr, bytes);
 }
 
@@ -2789,6 +2791,16 @@
       ids[i++] = node;
     }
   }
+
+  // If externally invoked in interleave mode then get node bitmasks from interleave mode pointer.
+  if (Linux::_numa_interleave_ptr != NULL ) {
+    i = 0;
+    for (int node = 0; node <= highest_node_number; node++) {
+      if (Linux::_numa_bitmask_isbitset(Linux::_numa_interleave_ptr, node)) {
+        ids[i++] = node;
+      }
+    }
+  }
   return i;
 }
 
@@ -2888,11 +2900,18 @@
                                        libnuma_dlsym(handle, "numa_distance")));
       set_numa_get_membind(CAST_TO_FN_PTR(numa_get_membind_func_t,
                                           libnuma_v2_dlsym(handle, "numa_get_membind")));
+      set_numa_get_interleave_mask(CAST_TO_FN_PTR(numa_get_interleave_mask_func_t,
+                                          libnuma_v2_dlsym(handle, "numa_get_interleave_mask")));
 
       if (numa_available() != -1) {
+        struct bitmask *bmp;
         set_numa_all_nodes((unsigned long*)libnuma_dlsym(handle, "numa_all_nodes"));
         set_numa_all_nodes_ptr((struct bitmask **)libnuma_dlsym(handle, "numa_all_nodes_ptr"));
         set_numa_nodes_ptr((struct bitmask **)libnuma_dlsym(handle, "numa_nodes_ptr"));
+        bmp = _numa_get_interleave_mask();
+        set_numa_interleave_ptr(&bmp);
+        bmp = _numa_get_membind();
+        set_numa_membind_ptr(&bmp);
         // Create an index -> node mapping, since nodes are not always consecutive
         _nindex_to_node = new (ResourceObj::C_HEAP, mtInternal) GrowableArray<int>(0, true);
         rebuild_nindex_to_node_map();
@@ -3019,9 +3038,12 @@
 os::Linux::numa_bitmask_isbitset_func_t os::Linux::_numa_bitmask_isbitset;
 os::Linux::numa_distance_func_t os::Linux::_numa_distance;
 os::Linux::numa_get_membind_func_t os::Linux::_numa_get_membind;
+os::Linux::numa_get_interleave_mask_func_t os::Linux::_numa_get_interleave_mask;
 unsigned long* os::Linux::_numa_all_nodes;
 struct bitmask* os::Linux::_numa_all_nodes_ptr;
 struct bitmask* os::Linux::_numa_nodes_ptr;
+struct bitmask* os::Linux::_numa_interleave_ptr;
+struct bitmask* os::Linux::_numa_membind_ptr;
 
 bool os::pd_uncommit_memory(char* addr, size_t size) {
   uintptr_t res = (uintptr_t) ::mmap(addr, size, PROT_NONE,
@@ -5005,13 +5027,55 @@
     if (!Linux::libnuma_init()) {
       UseNUMA = false;
     } else {
-      if ((Linux::numa_max_node() < 1) || Linux::isbound_to_single_node()) {
-        // If there's only one node (they start from 0) or if the process
-        // is bound explicitly to a single node using membind, disable NUMA.
-        UseNUMA = false;
+
+      // Identify whether running in membind or interleave mode.
+      struct bitmask *bmp;
+      bool _is_membind = false;
+      bool _is_interleaved = false;
+      char _buf[BUFSIZ] = {'\0'};
+      char *_bufptr = _buf;
+
+      log_info(os)("UseNUMA is enabled");
+      // Check for membind mode.
+      bmp = Linux::_numa_membind_ptr;
+      for (int node = 0; node <= Linux::numa_max_node() ; node++) {
+        if (Linux::_numa_bitmask_isbitset(bmp, node)) {
+          _is_membind = true;
+        }
       }
+
+      // Check for interleave mode.
+      bmp = Linux::_numa_interleave_ptr;
+      for (int node = 0; node <= Linux::numa_max_node() ; node++) {
+        if (Linux::_numa_bitmask_isbitset(bmp, node)) {
+          _is_interleaved = true;
+          // Set membind to false as interleave mode allows all nodes to be used.
+          _is_membind = false;
+        }
+      }
+
+      if (_is_membind) {
+        bmp = Linux::_numa_membind_ptr;
+        Linux::set_numa_interleave_ptr (NULL);
+        log_info(os) ("  Java is configured to run in membind mode");
+      }
+
+      if (_is_interleaved) {
+        bmp = Linux::_numa_interleave_ptr;
+        Linux::set_numa_membind_ptr (NULL);
+        log_info(os) ("  Java is configured to run in interleave mode");
+      }
+
+      for (int node = 0; node <= Linux::numa_max_node() ; node++) {
+        if (Linux::_numa_bitmask_isbitset(bmp, node)) {
+          _bufptr += sprintf (_bufptr, "%d, ", node);
+        }
+      }
+      _bufptr[-2 ] = '\0';
+      log_info(os) ("  Heap will be configured using NUMA memory nodes: %s", _buf);
     }
 
+
     if (UseParallelGC && UseNUMA && UseLargePages && !can_commit_large_page_memory()) {
       // With SHM and HugeTLBFS large pages we cannot uncommit a page, so there's no way
       // we can make the adaptive lgrp chunk resizing work. If the user specified both
diff -r d537553ed639 src/hotspot/os/linux/os_linux.hpp
--- a/src/hotspot/os/linux/os_linux.hpp	Wed Nov 28 22:29:35 2018 -0500
+++ b/src/hotspot/os/linux/os_linux.hpp	Thu Nov 29 16:56:15 2018 +0530
@@ -222,6 +222,7 @@
   typedef void (*numa_interleave_memory_func_t)(void *start, size_t size, unsigned long *nodemask);
   typedef void (*numa_interleave_memory_v2_func_t)(void *start, size_t size, struct bitmask* mask);
   typedef struct bitmask* (*numa_get_membind_func_t)(void);
+  typedef struct bitmask* (*numa_get_interleave_mask_func_t)(void);
 
   typedef void (*numa_set_bind_policy_func_t)(int policy);
   typedef int (*numa_bitmask_isbitset_func_t)(struct bitmask *bmp, unsigned int n);
@@ -239,9 +240,12 @@
   static numa_bitmask_isbitset_func_t _numa_bitmask_isbitset;
   static numa_distance_func_t _numa_distance;
   static numa_get_membind_func_t _numa_get_membind;
+  static numa_get_interleave_mask_func_t _numa_get_interleave_mask;
   static unsigned long* _numa_all_nodes;
   static struct bitmask* _numa_all_nodes_ptr;
   static struct bitmask* _numa_nodes_ptr;
+  static struct bitmask* _numa_interleave_ptr;
+  static struct bitmask* _numa_membind_ptr;
 
   static void set_sched_getcpu(sched_getcpu_func_t func) { _sched_getcpu = func; }
   static void set_numa_node_to_cpus(numa_node_to_cpus_func_t func) { _numa_node_to_cpus = func; }
@@ -255,9 +259,12 @@
   static void set_numa_bitmask_isbitset(numa_bitmask_isbitset_func_t func) { _numa_bitmask_isbitset = func; }
   static void set_numa_distance(numa_distance_func_t func) { _numa_distance = func; }
   static void set_numa_get_membind(numa_get_membind_func_t func) { _numa_get_membind = func; }
+  static void set_numa_get_interleave_mask(numa_get_interleave_mask_func_t func) { _numa_get_interleave_mask = func; }
   static void set_numa_all_nodes(unsigned long* ptr) { _numa_all_nodes = ptr; }
   static void set_numa_all_nodes_ptr(struct bitmask **ptr) { _numa_all_nodes_ptr = (ptr == NULL ? NULL : *ptr); }
   static void set_numa_nodes_ptr(struct bitmask **ptr) { _numa_nodes_ptr = (ptr == NULL ? NULL : *ptr); }
+  static void set_numa_interleave_ptr(struct bitmask **ptr) { _numa_interleave_ptr = (ptr == NULL ? NULL : *ptr); }
+  static void set_numa_membind_ptr(struct bitmask **ptr) { _numa_membind_ptr = (ptr == NULL ? NULL : *ptr); }
   static int sched_getcpu_syscall(void);
  public:
   static int sched_getcpu()  { return _sched_getcpu != NULL ? _sched_getcpu() : -1; }
@@ -275,7 +282,11 @@
   static void numa_interleave_memory(void *start, size_t size) {
     // Use v2 api if available
     if (_numa_interleave_memory_v2 != NULL && _numa_all_nodes_ptr != NULL) {
-      _numa_interleave_memory_v2(start, size, _numa_all_nodes_ptr);
+      if (_numa_interleave_ptr != NULL)
+        // If externally invoked in interleave mode then use interleave bitmask.
+        _numa_interleave_memory_v2(start, size, _numa_interleave_ptr);
+      else 
+        _numa_interleave_memory_v2(start, size, _numa_membind_ptr);
     } else if (_numa_interleave_memory != NULL && _numa_all_nodes != NULL) {
       _numa_interleave_memory(start, size, _numa_all_nodes);
     }