RFR (S/M): 8213827: NUMA heap allocation does not respect process membind/interleave settings [Was: Re: [PATCH] JDK NUMA Interleaving issue]
amith pawar
amith.pawar at gmail.com
Thu Dec 13 09:41:49 UTC 2018
Hi Thomas,
Please find the attached patch updated as per your suggestion.
If everything OK then can you please commit this to repo ?
Thanks,
Amit Pawar
On Wed, Dec 12, 2018 at 6:36 PM Thomas Schatzl <thomas.schatzl at oracle.com>
wrote:
> Hi Amit,
>
> On Sat, 2018-12-08 at 04:50 +0530, amith pawar wrote:
> > Hi Thomas,
> >
> > Please check my inline comments and attached patch is updated as per
> > your suggestion.
>
> thanks a lot! Looks very good now.
>
> Fixed issues:
>
> - you can use the LogStream directly for writing, you do not need to
> create an outputStream from it.
>
> - "isrunning_..." should be renamed "is_running_..."
>
> There are still some "is" prefixes glued to the predicate left.
>
> Here are webrevs:
>
> http://cr.openjdk.java.net/~tschatzl/8213827/webrev.3/ (latest webrev)
> http://cr.openjdk.java.net/~tschatzl/8213827/webrev.2_to_3/ (above
> suggested changes)
> http://cr.openjdk.java.net/~tschatzl/8213827/webrev.2/ (original
> changes)
>
> Some comments further down.
>
> >
> > On Mon, Dec 3, 2018 at 6:12 PM Thomas Schatzl <
> > thomas.schatzl at oracle.com> wrote:
> > > Hi,
> > >
> > > On Fri, 2018-11-30 at 11:11 +0100, Thomas Schatzl wrote:
> > > > Hi Amit,
> > > >
> > > > [...]
> > > > >
> > > > > Sorry. Please check the attached patch.
> > > >
> > > > I created a webrev at
> > > > http://cr.openjdk.java.net/~tschatzl/8213827/webrev/ .
> > > >
> > > > Let the reviews begin :)
> > > >
> > >
> > > style issues:
> > >
> > > - random whitespaces in many places.
> > > - all if's or if-then-else require braces
> > > - only class members have a leading underscore
> > > - do not declare local variables long before they are used. That
> > > makes the code hard to read.
> > > - do not introduce local variables for no reason. E.g.
> > >
> > > 2907 struct bitmask *bmp;
> > >
> > > [...unrelated code...]
> > >
> > > 2911 bmp = _numa_get_interleave_mask();
> > > 2912 set_numa_interleave_ptr(&bmp);
> > > 2913 bmp = _numa_get_membind();
> > > 2914 set_numa_membind_ptr(&bmp);
> > >
> > > is much less readable than
> > >
> > > set_numa_interleave_ptr(_numa_get_interleave_mask());
> > > set_numa_membind_ptr(_numa_get_membind_mask());
> > >
> > > - in the type declaration, please put the * next to the type.
> > > While
> > > this is kind-of wrong, the majority of existing code does so (not
> > > only
> > > in these files).
> > >
> > > code issues:
> > >
> > > - interleave/membind mode are mutually exclusive, which the code
> > > kind-of tries to establish (and check) using special values for eg.
> > > the
> > > value of Linux::_numa_interleave_ptr.
> > >
> > > I think the code gets a lot more readable if you introduce an enum
> > > and
> > > an explicit global variable holding the current NUMA node. If you
> > > do
> > > not want to do that, move the condition
> > > "Linux::_numa_interleave_ptr !=
> > > 0" into a new method.
> > >
> > > - move the new NUMA initialization code in os::init_2() into a new
> > > method please. It's gotten very long now.
> > >
> > > - please avoid duplicating code; ie.
> > >
> > > 5039 // Check for membind mode.
> > > 5040 bmp = Linux::_numa_membind_ptr;
> > > 5041 for (int node = 0; node <= Linux::numa_max_node() ;
> > > node++)
> > > {
> > > 5042 if (Linux::_numa_bitmask_isbitset(bmp, node)) {
> > > 5043 _is_membind = true;
> > > 5044 }
> > > 5045 }
> > >
> > > 5048 bmp = Linux::_numa_interleave_ptr;
> > > 5049 for (int node = 0; node <= Linux::numa_max_node() ;
> > > node++)
> > > {
> > > 5050 if (Linux::_numa_bitmask_isbitset(bmp, node)) {
> > > 5051 _is_interleaved = true;
> > > 5052 // Set membind to false as interleave mode allows
> > > all
> > > nodes to be used.
> > > 5053 _is_membind = false;
> > > 5054 }
> > > 5055 }
> > >
> > > and right after:
> > >
> > > Something like:
> > >
> > > is_interleave =
> > > is_any_bit_in_bitmask_set(Linux::_numa_interleave_ptr);
> > > is_membind = !is_interleave &&
> > > is_any_bit_in_bitmask_set(Linux::_numa_membind_ptr);
> > >
> > > with an appropriate is_any_bit_in_bitmask_set() helper method is
> > > much
> > > more clear.
> > >
> > > (Even better, as suggested above use a global "NUMA mode" enum that
> > > is
> > > filled in somewhere instead of multiple flags)
> > >
> > > - Use a LogStream to print the log messages instead of rolling
> > > your
> > > own code to create a string containing the node number. The code
> > > that
> > > is there exhibits a potential buffer overflow (with many NUMA
> > > nodes),
> > > and a potential out-of-bounds write.
> > >
> > > - the log message for selected NUMA mode and nodes should be a
> > > single
> > > log message.
> > >
> > > - partially pre-existing: the numa_set_xxx_ptr methods take a
> > > "struct bitmask**". Could you explain why "**"? All of those store
> > > the dereferenced result anyway, so why not pass the dereferenced
> > > pointer right away?
> >
> > Following bitmask pointers are defined in libnuma library. 'dlvsym'
> > will return the address of pointer which needs to be de-referenced.
> > Updated other methods as suggested.
> > struct bitmask *numa_all_nodes_ptr;
> > struct bitmask *numa_no_nodes_ptr;
> > struct bitmask *numa_all_cpus_ptr;
> > > This also simplifies the code at the callers.
> > >
> > > - partially pre-existing: I would prefer if the numa_set_xx_ptr
> > > methods were all renamed to numa_set_xx_bitmask. The "ptr" suffix
> > > seems redundant, and most importantly very unspecific.
> >
> > Newly introduced membind and interleave bitmask pointer names are
> > changed. Also above mentioned three pointers are imported. If you
> > suggest will rename these also.
>
> If you want we can do this in a separate changeset. Otherwise, please
> do right away.
> Please also fix the glued "is" in the isnode_in_bound_nodes,
> isbound_to_single_node, isnode_in_existing_nodes,
> isnode_in_configured_nodes methods.
>
> > > - in numa_interleave_memory(), why does the code not make the
> > > membind/interleave mode distinction when the v1 API is used?
> >
> > v2 API's works as expected for membind/interleave case but
> > numa_get_membind_v1 API's always returns null . v1 and v2 prototypes
> > are also not same. Please check below.
> > nodemask_t numa_get_membind_v1(void)
> > struct bitmask * numa_get_membind_v2(void)
> >
> > void numa_interleave_memory_v1(void *mem, size_t size, const
> > nodemask_t *mask)
> > void numa_interleave_memory_v2(void *mem, size_t size, struct bitmask
> > *bmp)
>
> Okay, understood. Thanks!
>
> Thanks,
> Thomas
>
>
>
--
With best regards,
amit pawar
-------------- next part --------------
diff -r b9d34a97a4be src/hotspot/os/linux/os_linux.cpp
--- a/src/hotspot/os/linux/os_linux.cpp Thu Dec 13 13:03:26 2018 +0530
+++ b/src/hotspot/os/linux/os_linux.cpp Thu Dec 13 15:21:58 2018 +0530
@@ -33,6 +33,7 @@
#include "compiler/disassembler.hpp"
#include "interpreter/interpreter.hpp"
#include "logging/log.hpp"
+#include "logging/logStream.hpp"
#include "memory/allocation.inline.hpp"
#include "memory/filemap.hpp"
#include "oops/oop.inline.hpp"
@@ -2776,7 +2777,7 @@
// Get the total number of nodes in the system including nodes without memory.
for (node = 0; node <= highest_node_number; node++) {
- if (isnode_in_existing_nodes(node)) {
+ if (is_node_in_existing_nodes(node)) {
num_nodes++;
}
}
@@ -2792,7 +2793,7 @@
// node number. If the nodes have been bound explicitly using numactl membind,
// then allocate memory from those nodes only.
for (int node = 0; node <= highest_node_number; node++) {
- if (Linux::isnode_in_bound_nodes((unsigned int)node)) {
+ if (Linux::is_node_in_bound_nodes((unsigned int)node)) {
ids[i++] = node;
}
}
@@ -2895,11 +2896,15 @@
libnuma_dlsym(handle, "numa_distance")));
set_numa_get_membind(CAST_TO_FN_PTR(numa_get_membind_func_t,
libnuma_v2_dlsym(handle, "numa_get_membind")));
+ set_numa_get_interleave_mask(CAST_TO_FN_PTR(numa_get_interleave_mask_func_t,
+ libnuma_v2_dlsym(handle, "numa_get_interleave_mask")));
if (numa_available() != -1) {
set_numa_all_nodes((unsigned long*)libnuma_dlsym(handle, "numa_all_nodes"));
set_numa_all_nodes_ptr((struct bitmask **)libnuma_dlsym(handle, "numa_all_nodes_ptr"));
set_numa_nodes_ptr((struct bitmask **)libnuma_dlsym(handle, "numa_nodes_ptr"));
+ set_numa_interleave_bitmask(_numa_get_interleave_mask());
+ set_numa_membind_bitmask(_numa_get_membind());
// Create an index -> node mapping, since nodes are not always consecutive
_nindex_to_node = new (ResourceObj::C_HEAP, mtInternal) GrowableArray<int>(0, true);
rebuild_nindex_to_node_map();
@@ -2925,7 +2930,7 @@
nindex_to_node()->clear();
for (int node = 0; node <= highest_node_number; node++) {
- if (Linux::isnode_in_existing_nodes(node)) {
+ if (Linux::is_node_in_existing_nodes(node)) {
nindex_to_node()->append(node);
}
}
@@ -2962,16 +2967,16 @@
// the closest configured node. Check also if node is bound, i.e. it's allowed
// to allocate memory from the node. If it's not allowed, map cpus in that node
// to the closest node from which memory allocation is allowed.
- if (!isnode_in_configured_nodes(nindex_to_node()->at(i)) ||
- !isnode_in_bound_nodes(nindex_to_node()->at(i))) {
+ if (!is_node_in_configured_nodes(nindex_to_node()->at(i)) ||
+ !is_node_in_bound_nodes(nindex_to_node()->at(i))) {
closest_distance = INT_MAX;
// Check distance from all remaining nodes in the system. Ignore distance
// from itself, from another non-configured node, and from another non-bound
// node.
for (size_t m = 0; m < node_num; m++) {
if (m != i &&
- isnode_in_configured_nodes(nindex_to_node()->at(m)) &&
- isnode_in_bound_nodes(nindex_to_node()->at(m))) {
+ is_node_in_configured_nodes(nindex_to_node()->at(m)) &&
+ is_node_in_bound_nodes(nindex_to_node()->at(m))) {
distance = numa_distance(nindex_to_node()->at(i), nindex_to_node()->at(m));
// If a closest node is found, update. There is always at least one
// configured and bound node in the system so there is always at least
@@ -3026,9 +3031,13 @@
os::Linux::numa_bitmask_isbitset_func_t os::Linux::_numa_bitmask_isbitset;
os::Linux::numa_distance_func_t os::Linux::_numa_distance;
os::Linux::numa_get_membind_func_t os::Linux::_numa_get_membind;
+os::Linux::numa_get_interleave_mask_func_t os::Linux::_numa_get_interleave_mask;
+os::Linux::Numa_allocation_policy os::Linux::_current_numa_policy;
unsigned long* os::Linux::_numa_all_nodes;
struct bitmask* os::Linux::_numa_all_nodes_ptr;
struct bitmask* os::Linux::_numa_nodes_ptr;
+struct bitmask* os::Linux::_numa_interleave_bitmask;
+struct bitmask* os::Linux::_numa_membind_bitmask;
bool os::pd_uncommit_memory(char* addr, size_t size) {
uintptr_t res = (uintptr_t) ::mmap(addr, size, PROT_NONE,
@@ -4958,6 +4967,74 @@
OSContainer::init();
}
+void os::Linux::numa_init() {
+
+ // Java can be invoked as
+ // 1. Without numactl and heap will be allocated/configured on all nodes as
+ // per the system policy.
+ // 2. With numactl --interleave:
+ // Use numa_get_interleave_mask(v2) API to get nodes bitmask. The same
+ // API for membind case bitmask is reset.
+ // Interleave is only hint and Kernel can fallback to other nodes if
+ // no memory is available on the target nodes.
+ // 3. With numactl --membind:
+ // Use numa_get_membind(v2) API to get nodes bitmask. The same API for
+ // interleave case returns bitmask of all nodes.
+ // numa_all_nodes_ptr holds bitmask of all nodes.
+ // numa_get_interleave_mask(v2) and numa_get_membind(v2) APIs returns correct
+ // bitmask when externally configured to run on all or fewer nodes.
+
+ if (!Linux::libnuma_init()) {
+ UseNUMA = false;
+ } else {
+ if ((Linux::numa_max_node() < 1) || Linux::is_bound_to_single_node()) {
+ // If there's only one node (they start from 0) or if the process
+ // is bound explicitly to a single node using membind, disable NUMA.
+ UseNUMA = false;
+ } else {
+
+ LogTarget(Info,os) log;
+ LogStream ls(log);
+
+ Linux::set_configured_numa_policy (Linux::identify_numa_policy());
+
+ struct bitmask* bmp = Linux::_numa_membind_bitmask;
+ const char* numa_mode = "membind";
+
+ if (Linux::is_running_in_interleave_mode()) {
+ bmp = Linux::_numa_interleave_bitmask;
+ numa_mode = "interleave";
+ }
+
+ ls.print("UseNUMA is enabled and invoked in '%s' mode."
+ " Heap will be configured using NUMA memory nodes:", numa_mode);
+
+ for (int node = 0; node <= Linux::numa_max_node(); node++) {
+ if (Linux::_numa_bitmask_isbitset(bmp, node)) {
+ ls.print(" %d", node);
+ }
+ }
+ }
+ }
+
+ if (UseParallelGC && UseNUMA && UseLargePages && !can_commit_large_page_memory()) {
+ // With SHM and HugeTLBFS large pages we cannot uncommit a page, so there's no way
+ // we can make the adaptive lgrp chunk resizing work. If the user specified both
+ // UseNUMA and UseLargePages (or UseSHM/UseHugeTLBFS) on the command line - warn
+ // and disable adaptive resizing.
+ if (UseAdaptiveSizePolicy || UseAdaptiveNUMAChunkSizing) {
+ warning("UseNUMA is not fully compatible with SHM/HugeTLBFS large pages, "
+ "disabling adaptive resizing (-XX:-UseAdaptiveSizePolicy -XX:-UseAdaptiveNUMAChunkSizing)");
+ UseAdaptiveSizePolicy = false;
+ UseAdaptiveNUMAChunkSizing = false;
+ }
+ }
+
+ if (!UseNUMA && ForceNUMA) {
+ UseNUMA = true;
+ }
+}
+
// this is called _after_ the global arguments have been parsed
jint os::init_2(void) {
@@ -5002,32 +5079,7 @@
Linux::glibc_version(), Linux::libpthread_version());
if (UseNUMA) {
- if (!Linux::libnuma_init()) {
- UseNUMA = false;
- } else {
- if ((Linux::numa_max_node() < 1) || Linux::isbound_to_single_node()) {
- // If there's only one node (they start from 0) or if the process
- // is bound explicitly to a single node using membind, disable NUMA.
- UseNUMA = false;
- }
- }
-
- if (UseParallelGC && UseNUMA && UseLargePages && !can_commit_large_page_memory()) {
- // With SHM and HugeTLBFS large pages we cannot uncommit a page, so there's no way
- // we can make the adaptive lgrp chunk resizing work. If the user specified both
- // UseNUMA and UseLargePages (or UseSHM/UseHugeTLBFS) on the command line - warn
- // and disable adaptive resizing.
- if (UseAdaptiveSizePolicy || UseAdaptiveNUMAChunkSizing) {
- warning("UseNUMA is not fully compatible with SHM/HugeTLBFS large pages, "
- "disabling adaptive resizing (-XX:-UseAdaptiveSizePolicy -XX:-UseAdaptiveNUMAChunkSizing)");
- UseAdaptiveSizePolicy = false;
- UseAdaptiveNUMAChunkSizing = false;
- }
- }
-
- if (!UseNUMA && ForceNUMA) {
- UseNUMA = true;
- }
+ Linux::numa_init();
}
if (MaxFDLimit) {
diff -r b9d34a97a4be src/hotspot/os/linux/os_linux.hpp
--- a/src/hotspot/os/linux/os_linux.hpp Thu Dec 13 13:03:26 2018 +0530
+++ b/src/hotspot/os/linux/os_linux.hpp Thu Dec 13 15:21:58 2018 +0530
@@ -211,6 +211,7 @@
// none present
private:
+ static void numa_init();
static void expand_stack_to(address bottom);
typedef int (*sched_getcpu_func_t)(void);
@@ -222,6 +223,7 @@
typedef void (*numa_interleave_memory_func_t)(void *start, size_t size, unsigned long *nodemask);
typedef void (*numa_interleave_memory_v2_func_t)(void *start, size_t size, struct bitmask* mask);
typedef struct bitmask* (*numa_get_membind_func_t)(void);
+ typedef struct bitmask* (*numa_get_interleave_mask_func_t)(void);
typedef void (*numa_set_bind_policy_func_t)(int policy);
typedef int (*numa_bitmask_isbitset_func_t)(struct bitmask *bmp, unsigned int n);
@@ -239,9 +241,12 @@
static numa_bitmask_isbitset_func_t _numa_bitmask_isbitset;
static numa_distance_func_t _numa_distance;
static numa_get_membind_func_t _numa_get_membind;
+ static numa_get_interleave_mask_func_t _numa_get_interleave_mask;
static unsigned long* _numa_all_nodes;
static struct bitmask* _numa_all_nodes_ptr;
static struct bitmask* _numa_nodes_ptr;
+ static struct bitmask* _numa_interleave_bitmask;
+ static struct bitmask* _numa_membind_bitmask;
static void set_sched_getcpu(sched_getcpu_func_t func) { _sched_getcpu = func; }
static void set_numa_node_to_cpus(numa_node_to_cpus_func_t func) { _numa_node_to_cpus = func; }
@@ -255,10 +260,21 @@
static void set_numa_bitmask_isbitset(numa_bitmask_isbitset_func_t func) { _numa_bitmask_isbitset = func; }
static void set_numa_distance(numa_distance_func_t func) { _numa_distance = func; }
static void set_numa_get_membind(numa_get_membind_func_t func) { _numa_get_membind = func; }
+ static void set_numa_get_interleave_mask(numa_get_interleave_mask_func_t func) { _numa_get_interleave_mask = func; }
static void set_numa_all_nodes(unsigned long* ptr) { _numa_all_nodes = ptr; }
static void set_numa_all_nodes_ptr(struct bitmask **ptr) { _numa_all_nodes_ptr = (ptr == NULL ? NULL : *ptr); }
static void set_numa_nodes_ptr(struct bitmask **ptr) { _numa_nodes_ptr = (ptr == NULL ? NULL : *ptr); }
+ static void set_numa_interleave_bitmask(struct bitmask* ptr) { _numa_interleave_bitmask = ptr ; }
+ static void set_numa_membind_bitmask(struct bitmask* ptr) { _numa_membind_bitmask = ptr ; }
static int sched_getcpu_syscall(void);
+
+ enum Numa_allocation_policy{
+ not_initialized,
+ membind,
+ interleave
+ };
+ static Numa_allocation_policy _current_numa_policy;
+
public:
static int sched_getcpu() { return _sched_getcpu != NULL ? _sched_getcpu() : -1; }
static int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen) {
@@ -272,11 +288,35 @@
static int numa_tonode_memory(void *start, size_t size, int node) {
return _numa_tonode_memory != NULL ? _numa_tonode_memory(start, size, node) : -1;
}
+
+ static bool is_running_in_interleave_mode() {
+ return _current_numa_policy == interleave ? true : false;
+ }
+
+ static void set_configured_numa_policy(Numa_allocation_policy numa_policy) {
+ _current_numa_policy = numa_policy ;
+ }
+
+ static Numa_allocation_policy identify_numa_policy() {
+ Numa_allocation_policy current_policy = membind;
+ for (int node = 0; node <= Linux::numa_max_node() ; node++) {
+ if (Linux::_numa_bitmask_isbitset(Linux::_numa_interleave_bitmask, node)) {
+ current_policy = interleave;
+ }
+ }
+ return current_policy ;
+ }
+
static void numa_interleave_memory(void *start, size_t size) {
// Use v2 api if available
- if (_numa_interleave_memory_v2 != NULL && _numa_all_nodes_ptr != NULL) {
- _numa_interleave_memory_v2(start, size, _numa_all_nodes_ptr);
- } else if (_numa_interleave_memory != NULL && _numa_all_nodes != NULL) {
+ if (_numa_interleave_memory_v2 != NULL && _numa_membind_bitmask != NULL) {
+ // Use interleave bitmask while running interleave mode.
+ if (is_running_in_interleave_mode()) {
+ _numa_interleave_memory_v2(start, size, _numa_interleave_bitmask);
+ } else if (_numa_membind_bitmask != NULL) {
+ _numa_interleave_memory_v2(start, size, _numa_membind_bitmask);
+ }
+ } else {
_numa_interleave_memory(start, size, _numa_all_nodes);
}
}
@@ -291,14 +331,14 @@
static int get_node_by_cpu(int cpu_id);
static int get_existing_num_nodes();
// Check if numa node is configured (non-zero memory node).
- static bool isnode_in_configured_nodes(unsigned int n) {
+ static bool is_node_in_configured_nodes(unsigned int n) {
if (_numa_bitmask_isbitset != NULL && _numa_all_nodes_ptr != NULL) {
return _numa_bitmask_isbitset(_numa_all_nodes_ptr, n);
} else
return false;
}
// Check if numa node exists in the system (including zero memory nodes).
- static bool isnode_in_existing_nodes(unsigned int n) {
+ static bool is_node_in_existing_nodes(unsigned int n) {
if (_numa_bitmask_isbitset != NULL && _numa_nodes_ptr != NULL) {
return _numa_bitmask_isbitset(_numa_nodes_ptr, n);
} else if (_numa_bitmask_isbitset != NULL && _numa_all_nodes_ptr != NULL) {
@@ -317,16 +357,19 @@
return false;
}
// Check if node is in bound node set.
- static bool isnode_in_bound_nodes(int node) {
- if (_numa_get_membind != NULL && _numa_bitmask_isbitset != NULL) {
- return _numa_bitmask_isbitset(_numa_get_membind(), node);
- } else {
- return false;
+ static bool is_node_in_bound_nodes(int node) {
+ if (_numa_bitmask_isbitset != NULL) {
+ if (is_running_in_interleave_mode()) {
+ return _numa_bitmask_isbitset(_numa_interleave_bitmask, node);
+ } else {
+ return _numa_membind_bitmask != NULL ? _numa_bitmask_isbitset(_numa_membind_bitmask, node) : false;
+ }
}
+ return false;
}
// Check if bound to only one numa node.
// Returns true if bound to a single numa node, otherwise returns false.
- static bool isbound_to_single_node() {
+ static bool is_bound_to_single_node() {
int nodes = 0;
struct bitmask* bmp = NULL;
unsigned int node = 0;
More information about the hotspot-dev
mailing list