RFR (S/M): 8213827: NUMA heap allocation does not respect process membind/interleave settings [Was: Re: [PATCH] JDK NUMA Interleaving issue]

Fri Dec 7 23:20:58 UTC 2018

Hi Thomas,

Please check my inline comments and attached patch is updated as per your
suggestion.

On Mon, Dec 3, 2018 at 6:12 PM Thomas Schatzl <thomas.schatzl at oracle.com>
wrote:

> Hi,
>
> On Fri, 2018-11-30 at 11:11 +0100, Thomas Schatzl wrote:
> > Hi Amit,
> >
> > [...]
> > >
> > > Sorry. Please check the attached patch.
> >
> > I created a webrev at
> > http://cr.openjdk.java.net/~tschatzl/8213827/webrev/ .
> >
> > Let the reviews begin :)
> >
>
> style issues:
>
>   - random whitespaces in many places.
>   - all if's or if-then-else require braces
>   - only class members have a leading underscore
>   - do not declare local variables long before they are used. That
> makes the code hard to read.
>   - do not introduce local variables for no reason. E.g.
>
> 2907         struct bitmask *bmp;
>
> [...unrelated code...]
>
> 2911         bmp = _numa_get_interleave_mask();
> 2912         set_numa_interleave_ptr(&bmp);
> 2913         bmp = _numa_get_membind();
> 2914         set_numa_membind_ptr(&bmp);
>
> is much less readable than
>
> set_numa_interleave_ptr(_numa_get_interleave_mask());
> set_numa_membind_ptr(_numa_get_membind_mask());
>
>   - in the type declaration, please put the * next to the type. While
> this is kind-of wrong, the majority of existing code does so (not only
> in these files).
>
> code issues:
>
>   - interleave/membind mode are mutually exclusive, which the code
> kind-of tries to establish (and check) using special values for eg. the
> value of Linux::_numa_interleave_ptr.
>
> I think the code gets a lot more readable if you introduce an enum and
> an explicit global variable holding the current NUMA node. If you do
> not want to do that, move the condition "Linux::_numa_interleave_ptr !=
> 0" into a new method.
>
>  - move the new NUMA initialization code in os::init_2() into a new
> method please. It's gotten very long now.
>
>  - please avoid duplicating code; ie.
>
> 5039       // Check for membind mode.
> 5040       bmp = Linux::_numa_membind_ptr;
> 5041       for (int node = 0; node <= Linux::numa_max_node() ; node++)
> {
> 5042         if (Linux::_numa_bitmask_isbitset(bmp, node)) {
> 5043           _is_membind = true;
> 5044         }
> 5045       }
>
> 5048       bmp = Linux::_numa_interleave_ptr;
> 5049       for (int node = 0; node <= Linux::numa_max_node() ; node++)
> {
> 5050         if (Linux::_numa_bitmask_isbitset(bmp, node)) {
> 5051           _is_interleaved = true;
> 5052           // Set membind to false as interleave mode allows all
> nodes to be used.
> 5053           _is_membind = false;
> 5054         }
> 5055       }
>
> and right after:
>
> Something like:
>
> is_interleave = is_any_bit_in_bitmask_set(Linux::_numa_interleave_ptr);
> is_membind = !is_interleave &&
> is_any_bit_in_bitmask_set(Linux::_numa_membind_ptr);
>
> with an appropriate is_any_bit_in_bitmask_set() helper method is much
> more clear.
>
> (Even better, as suggested above use a global "NUMA mode" enum that is
> filled in somewhere instead of multiple flags)
>
>   - Use a LogStream to print the log messages instead of rolling your
> own code to create a string containing the node number. The code that
> is there exhibits a potential buffer overflow (with many NUMA nodes),
> and a potential  out-of-bounds write.
>
>   - the log message for selected NUMA mode and nodes should be a single
> log message.
>
>   - partially pre-existing: the numa_set_xxx_ptr methods take a
> "struct bitmask**". Could you explain why "**"? All of those store the
> dereferenced result anyway, so why not pass the dereferenced pointer
> right away?
>
Following bitmask pointers are defined in libnuma library. 'dlvsym' will
return the address of pointer which needs to be de-referenced. Updated
other methods as suggested.
struct bitmask *numa_all_nodes_ptr;
struct bitmask *numa_no_nodes_ptr;
struct bitmask *numa_all_cpus_ptr;

>
> This also simplifies the code at the callers.
>
>   - partially pre-existing: I would prefer if the numa_set_xx_ptr
> methods were all renamed to numa_set_xx_bitmask. The "ptr" suffix
> seems redundant, and most importantly very unspecific.
>
Newly introduced membind and interleave bitmask pointer names are changed.
Also above mentioned three pointers are imported. If you suggest will
rename these also.

>
>   - in numa_interleave_memory(), why does the code not make the
> membind/interleave mode distinction when the v1 API is used?
>
v2 API's works as expected for membind/interleave case but
numa_get_membind_v1 API's always returns null . v1 and v2 prototypes are
also not same. Please check below.
nodemask_t numa_get_membind_v1(void)
struct bitmask * numa_get_membind_v2(void)

void numa_interleave_memory_v1(void *mem, size_t size, const nodemask_t
*mask)
void numa_interleave_memory_v2(void *mem, size_t size, struct bitmask *bmp)

>
> That method could probably cleaned up a bit, i.e. extracting common
> code.
>
>   - Pre-existing: numa_interleave_memory() could be cleaned up; why
> do we have two pointers to _numa_interleave_memory_v2 and
> _numa_interleave_memory around if their parameters are the same and
> we only ever use one of them?
>
As explained above, API prototypes are not same so this is required.

>
> That might be an additional cleanup that needs to be done.
>
>   - please split out the change that disables UseNUMA if there is
> a single node only. It seems relatively unrelated, and so should imho
> be looked at (evaluated) separately.
>
> While reviewing I already fixed a few of the above issues, see
>
> http://cr.openjdk.java.net/~tschatzl/8213827/webrev.0_to_1 (diff)
> http://cr.openjdk.java.net/~tschatzl/8213827/webrev.1 (full)
>
> webrevs. You can use it as starting point.
>
> Thanks,
>   Thomas
>
>

-- 
With best regards,
amit pawar
-------------- next part --------------
diff -r d537553ed639 src/hotspot/os/linux/os_linux.cpp

--- a/src/hotspot/os/linux/os_linux.cpp	Wed Nov 28 22:29:35 2018 -0500
+++ b/src/hotspot/os/linux/os_linux.cpp	Sat Dec 08 05:20:06 2018 +0530
@@ -33,6 +33,7 @@
 #include "compiler/disassembler.hpp"
 #include "interpreter/interpreter.hpp"
 #include "logging/log.hpp"
+#include "logging/logStream.hpp"
 #include "memory/allocation.inline.hpp"
 #include "memory/filemap.hpp"
 #include "oops/oop.inline.hpp"
@@ -2888,11 +2889,15 @@
                                        libnuma_dlsym(handle, "numa_distance")));
       set_numa_get_membind(CAST_TO_FN_PTR(numa_get_membind_func_t,
                                           libnuma_v2_dlsym(handle, "numa_get_membind")));
+      set_numa_get_interleave_mask(CAST_TO_FN_PTR(numa_get_interleave_mask_func_t,
+                                          libnuma_v2_dlsym(handle, "numa_get_interleave_mask")));
 
       if (numa_available() != -1) {
         set_numa_all_nodes((unsigned long*)libnuma_dlsym(handle, "numa_all_nodes"));
         set_numa_all_nodes_ptr((struct bitmask **)libnuma_dlsym(handle, "numa_all_nodes_ptr"));
         set_numa_nodes_ptr((struct bitmask **)libnuma_dlsym(handle, "numa_nodes_ptr"));
+        set_numa_interleave_bitmask(_numa_get_interleave_mask());
+        set_numa_membind_bitmask(_numa_get_membind());
         // Create an index -> node mapping, since nodes are not always consecutive
         _nindex_to_node = new (ResourceObj::C_HEAP, mtInternal) GrowableArray<int>(0, true);
         rebuild_nindex_to_node_map();
@@ -3019,9 +3024,13 @@
 os::Linux::numa_bitmask_isbitset_func_t os::Linux::_numa_bitmask_isbitset;
 os::Linux::numa_distance_func_t os::Linux::_numa_distance;
 os::Linux::numa_get_membind_func_t os::Linux::_numa_get_membind;
+os::Linux::numa_get_interleave_mask_func_t os::Linux::_numa_get_interleave_mask;
+os::Linux::Numa_allocation_policy os::Linux::_current_numa_policy;
 unsigned long* os::Linux::_numa_all_nodes;
 struct bitmask* os::Linux::_numa_all_nodes_ptr;
 struct bitmask* os::Linux::_numa_nodes_ptr;
+struct bitmask* os::Linux::_numa_interleave_bitmask;
+struct bitmask* os::Linux::_numa_membind_bitmask;
 
 bool os::pd_uncommit_memory(char* addr, size_t size) {
   uintptr_t res = (uintptr_t) ::mmap(addr, size, PROT_NONE,
@@ -4958,6 +4967,75 @@
   OSContainer::init();
 }
 
+void os::Linux::numa_init() {
+
+  // Java can be invoked as
+  // 1. Without numactl and heap will be allocated/configured on all nodes as
+  //    per the system policy.
+  // 2. With numactl --interleave: 
+  //      Use numa_get_interleave_mask(v2) API to get nodes bitmask. The same
+  //      API for membind case bitmask is reset.
+  //      Interleave is only hint and Kernel can fallback to other nodes if 
+  //      no memory is available on the target nodes.
+  // 3. With numactl --membind: 
+  //      Use numa_get_membind(v2) API to get nodes bitmask. The same API for
+  //      interleave case returns bitmask of all nodes.
+  // numa_all_nodes_ptr holds bitmask of all nodes. 
+  // numa_get_interleave_mask(v2) and numa_get_membind(v2) APIs returns correct
+  // bitmask when externally configured to run on all or fewer nodes.
+
+  if (!Linux::libnuma_init()) {
+    UseNUMA = false;
+  } else {
+    if ((Linux::numa_max_node() < 1) || Linux::isbound_to_single_node()) {
+      // If there's only one node (they start from 0) or if the process
+      // is bound explicitly to a single node using membind, disable NUMA.
+      UseNUMA = false;
+    } else {
+
+      LogTarget(Info,os) log;
+      LogStream ls(log);
+      outputStream *st=&ls;
+
+      Linux::set_configured_numa_policy (Linux::identify_numa_policy());
+
+      struct bitmask* bmp = Linux::_numa_membind_bitmask;
+      const char* numa_mode = "membind";
+
+      if (Linux::isrunning_in_interleave_mode()) {
+        bmp = Linux::_numa_interleave_bitmask;
+        numa_mode = "interleave";
+      }
+
+      st->print("UseNUMA is enabled and invoked in '%s' mode."
+                " Heap will be configured using NUMA memory nodes:", numa_mode);
+
+      for (int node = 0; node < Linux::numa_max_node(); node++) {
+        if (Linux::_numa_bitmask_isbitset(bmp, node)) {
+          st->print(" %d", node);
+        }
+      }
+    }
+  }
+
+  if (UseParallelGC && UseNUMA && UseLargePages && !can_commit_large_page_memory()) {
+    // With SHM and HugeTLBFS large pages we cannot uncommit a page, so there's no way
+    // we can make the adaptive lgrp chunk resizing work. If the user specified both
+    // UseNUMA and UseLargePages (or UseSHM/UseHugeTLBFS) on the command line - warn
+    // and disable adaptive resizing.
+    if (UseAdaptiveSizePolicy || UseAdaptiveNUMAChunkSizing) {
+      warning("UseNUMA is not fully compatible with SHM/HugeTLBFS large pages, "
+              "disabling adaptive resizing (-XX:-UseAdaptiveSizePolicy -XX:-UseAdaptiveNUMAChunkSizing)");
+      UseAdaptiveSizePolicy = false;
+      UseAdaptiveNUMAChunkSizing = false;
+    }
+  }
+
+  if (!UseNUMA && ForceNUMA) {
+    UseNUMA = true;
+  }
+}
+
 // this is called _after_ the global arguments have been parsed
 jint os::init_2(void) {
 
@@ -5002,32 +5080,7 @@
                Linux::glibc_version(), Linux::libpthread_version());
 
   if (UseNUMA) {
-    if (!Linux::libnuma_init()) {
-      UseNUMA = false;
-    } else {
-      if ((Linux::numa_max_node() < 1) || Linux::isbound_to_single_node()) {
-        // If there's only one node (they start from 0) or if the process
-        // is bound explicitly to a single node using membind, disable NUMA.
-        UseNUMA = false;
-      }
-    }
-
-    if (UseParallelGC && UseNUMA && UseLargePages && !can_commit_large_page_memory()) {
-      // With SHM and HugeTLBFS large pages we cannot uncommit a page, so there's no way
-      // we can make the adaptive lgrp chunk resizing work. If the user specified both
-      // UseNUMA and UseLargePages (or UseSHM/UseHugeTLBFS) on the command line - warn
-      // and disable adaptive resizing.
-      if (UseAdaptiveSizePolicy || UseAdaptiveNUMAChunkSizing) {
-        warning("UseNUMA is not fully compatible with SHM/HugeTLBFS large pages, "
-                "disabling adaptive resizing (-XX:-UseAdaptiveSizePolicy -XX:-UseAdaptiveNUMAChunkSizing)");
-        UseAdaptiveSizePolicy = false;
-        UseAdaptiveNUMAChunkSizing = false;
-      }
-    }
-
-    if (!UseNUMA && ForceNUMA) {
-      UseNUMA = true;
-    }
+    Linux::numa_init();
   }
 
   if (MaxFDLimit) {
diff -r d537553ed639 src/hotspot/os/linux/os_linux.hpp
--- a/src/hotspot/os/linux/os_linux.hpp	Wed Nov 28 22:29:35 2018 -0500
+++ b/src/hotspot/os/linux/os_linux.hpp	Sat Dec 08 05:20:06 2018 +0530
@@ -211,6 +211,7 @@
   // none present
 
  private:
+  static void numa_init();
   static void expand_stack_to(address bottom);
 
   typedef int (*sched_getcpu_func_t)(void);
@@ -222,6 +223,7 @@
   typedef void (*numa_interleave_memory_func_t)(void *start, size_t size, unsigned long *nodemask);
   typedef void (*numa_interleave_memory_v2_func_t)(void *start, size_t size, struct bitmask* mask);
   typedef struct bitmask* (*numa_get_membind_func_t)(void);
+  typedef struct bitmask* (*numa_get_interleave_mask_func_t)(void);
 
   typedef void (*numa_set_bind_policy_func_t)(int policy);
   typedef int (*numa_bitmask_isbitset_func_t)(struct bitmask *bmp, unsigned int n);
@@ -239,9 +241,12 @@
   static numa_bitmask_isbitset_func_t _numa_bitmask_isbitset;
   static numa_distance_func_t _numa_distance;
   static numa_get_membind_func_t _numa_get_membind;
+  static numa_get_interleave_mask_func_t _numa_get_interleave_mask;
   static unsigned long* _numa_all_nodes;
   static struct bitmask* _numa_all_nodes_ptr;
   static struct bitmask* _numa_nodes_ptr;
+  static struct bitmask* _numa_interleave_bitmask;
+  static struct bitmask* _numa_membind_bitmask;
 
   static void set_sched_getcpu(sched_getcpu_func_t func) { _sched_getcpu = func; }
   static void set_numa_node_to_cpus(numa_node_to_cpus_func_t func) { _numa_node_to_cpus = func; }
@@ -255,10 +260,21 @@
   static void set_numa_bitmask_isbitset(numa_bitmask_isbitset_func_t func) { _numa_bitmask_isbitset = func; }
   static void set_numa_distance(numa_distance_func_t func) { _numa_distance = func; }
   static void set_numa_get_membind(numa_get_membind_func_t func) { _numa_get_membind = func; }
+  static void set_numa_get_interleave_mask(numa_get_interleave_mask_func_t func) { _numa_get_interleave_mask = func; }
   static void set_numa_all_nodes(unsigned long* ptr) { _numa_all_nodes = ptr; }
   static void set_numa_all_nodes_ptr(struct bitmask **ptr) { _numa_all_nodes_ptr = (ptr == NULL ? NULL : *ptr); }
   static void set_numa_nodes_ptr(struct bitmask **ptr) { _numa_nodes_ptr = (ptr == NULL ? NULL : *ptr); }
+  static void set_numa_interleave_bitmask(struct bitmask* ptr)     { _numa_interleave_bitmask = ptr ;   }
+  static void set_numa_membind_bitmask(struct bitmask* ptr)        { _numa_membind_bitmask = ptr ;      }
   static int sched_getcpu_syscall(void);
+
+  enum Numa_allocation_policy{
+    not_initialized,
+    membind,
+    interleave
+  };
+  static Numa_allocation_policy _current_numa_policy;
+
  public:
   static int sched_getcpu()  { return _sched_getcpu != NULL ? _sched_getcpu() : -1; }
   static int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen) {
@@ -272,11 +288,35 @@
   static int numa_tonode_memory(void *start, size_t size, int node) {
     return _numa_tonode_memory != NULL ? _numa_tonode_memory(start, size, node) : -1;
   }
+
+  static bool isrunning_in_interleave_mode() { 
+    return _current_numa_policy == interleave ? true : false; 
+  }
+
+  static void set_configured_numa_policy(Numa_allocation_policy numa_policy) { 
+    _current_numa_policy = numa_policy ; 
+  }
+
+  static Numa_allocation_policy identify_numa_policy() { 
+    Numa_allocation_policy current_policy = membind; 
+    for (int node = 0; node <= Linux::numa_max_node() ; node++) {
+      if (Linux::_numa_bitmask_isbitset(Linux::_numa_interleave_bitmask, node)) {
+        current_policy = interleave;
+      }
+    }
+    return current_policy ; 
+  }
+
   static void numa_interleave_memory(void *start, size_t size) {
     // Use v2 api if available
-    if (_numa_interleave_memory_v2 != NULL && _numa_all_nodes_ptr != NULL) {
-      _numa_interleave_memory_v2(start, size, _numa_all_nodes_ptr);
-    } else if (_numa_interleave_memory != NULL && _numa_all_nodes != NULL) {
+    if (_numa_interleave_memory_v2 != NULL && _numa_membind_bitmask != NULL) {
+        // Use interleave bitmask while running interleave mode. 
+      if (isrunning_in_interleave_mode()) {
+        _numa_interleave_memory_v2(start, size, _numa_interleave_bitmask);
+      } else if (_numa_membind_bitmask != NULL) {
+        _numa_interleave_memory_v2(start, size, _numa_membind_bitmask);
+      }
+    } else {
       _numa_interleave_memory(start, size, _numa_all_nodes);
     }
   }
@@ -318,11 +358,14 @@
   }
   // Check if node is in bound node set.
   static bool isnode_in_bound_nodes(int node) {
-    if (_numa_get_membind != NULL && _numa_bitmask_isbitset != NULL) {
-      return _numa_bitmask_isbitset(_numa_get_membind(), node);
-    } else {
-      return false;
+    if (_numa_bitmask_isbitset != NULL) {
+      if (isrunning_in_interleave_mode()) {
+        return _numa_bitmask_isbitset(_numa_interleave_bitmask, node); 
+      } else {
+        return _numa_membind_bitmask != NULL ? _numa_bitmask_isbitset(_numa_membind_bitmask, node) : false;
+      }
     }
+    return false;
   }
   // Check if bound to only one numa node.
   // Returns true if bound to a single numa node, otherwise returns false.