RFR(XL): 8220310: Implementation: NUMA-Aware Memory Allocation for G1, Mutator (1/3)

Mon Sep 16 23:12:46 UTC 2019

> On Sep 4, 2019, at 3:16 AM, sangheon.kim at oracle.com wrote:
> 
> Hi all,
> 
> Please review this patch, making G1 NUMA aware.
> This is the first part of G1 NUMA implementation.
> 
> - At G1 CollectedHeap initialization time, the given heap regions will be split and touched.
> - Internally node index(uint) is used instead of node id(int type) for easier access to arrays.
> - Only Linux is supported.
> - New logs are under gc+heap+numa
> - Current memory touch implementation will be replaced with WorkGang by JDK-8230411: NUMA aware work gang.
> 
> CR: https://bugs.openjdk.java.net/browse/JDK-8220310
> Webrev: http://cr.openjdk.java.net/~sangheki/8220310/webrev.0
> Testing: hs-tier 1 ~ 5 with +- UseNUMA.
> 
> Thanks,
> Sangheon

Here is an initial set of comments.  Some of these might be nonsense; I'm
not all that familiar with NUMA support.

------------------------------------------------------------------------------

Just in general, it's kind of disappointing that NUMA support for G1
apparently involves this much code, and is completely different from
ParallelGC NUMA support, and both are completely different from ZGC
NUMA support.  (And Shenandoah is probably yet another still.)
Oh well...

------------------------------------------------------------------------------

According to off-line discussion, the intent is that a HeapRegion only
contains memory from one NUMA node.  That isn't obvious though, and there
are API and data structure choices that obscure that.  It seems like some
things could be simplified with that constraint in mind.  I think some of
that is a result of trying to leave open support for NUMA aware off-heap
data structures, but that isn't part of this change, and might be better
kept separate from NUMA aware Java heap support.

I'll try to point out places where I think there are opportunities for
simplification.

------------------------------------------------------------------------------ 

According to off-line discussion, the touching mechanism might not be
necessary.  If that's true, that would be a significant simplification.  For
now I won't comment on the code for the touch threads, though I think
I found some problems there before the discussion suggesting we might
not need them happened.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.hpp
  64 // Manages NUMA node related information and memory touch threads.
...

I found the comment describing G1NUMA pretty hard to parse.  Even
after looking at the code I'm having trouble figuring out what's being
said.

This part I was able to figure out, but think it could be worded better.
  70 // Also provides a way to convert NUMA id and NUMA index(starting from zero and sequatial).
  71 // Actual NUMA ids would not be sequential nor not start from zero, so a conversion is necessary
  72 // between id and index.

Something like

  Also provides a mapping between NUMA indices (sequential, starting
  from zero) and NUMA ids (which may not start at zero and may not be
  sequential).  Indices are generally used by clients, and mapped to
  ids when dealing with the OS/hardware layer.

(I *think* that's true.)

But then, why expose node_ids at all in G1NUMA? It seems like node_ids
could be completely relegated to an implementation detail that clients
never need to deal with.

As evidence for not needing to expose ids to clients, there are no
callers of numa_id_of_index in any of the series of 3 patches.  And
G1NUMA::numa_ids() and G1MemoryMultNodeManager::node_ids() are only
used in a couple of places for printing/logging and by
WB_G1MemoryNodeIds.  I'm not sure what the last is for, but the
printing and logging code could be isolated into G1NUMA.

Also, it looks like the index to node_id mapping provided here is
duplicating the information provided by os::Linux::nindex_to_node.
Not that this necessarily helps with anything in generic code...

------------------------------------------------------------------------------ 
src/hotspot/share/gc/g1/g1MemoryNodeManager.hpp 
  57   virtual int* node_ids() const { static int dummy_id = 0; return &dummy_id; }

Can the return type here be "const int*" ?  That would be better if possible.

[This is probably moot if clients only deal with numa indexes and not
node_ids, as suggested earlier.]

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1MemoryNodeManager.hpp 
  63   virtual uint guarantee_valid_node_index(uint node_index) const { return 0; }

Given naming conventions elsewhere, the name of this function makes me
expect it to be a verification function.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.inline.hpp
  29 #include "gc/shared/taskqueue.inline.hpp"

This #include seems to be just to obtain randomParkAndMiller. It would be
better to refactor that function into its own file(s). Or make it part of
some component providing various random number generators.

------------------------------------------------------------------------------ 
src/hotspot/share/gc/g1/g1NUMA.inline.hpp
  45   int seed = randomParkAndMiller(_seed);
  46   Atomic::store(seed, &_seed);

This sequence may lose updates.  I don't know if that's important.
And why isn't this just using os::random(), or maybe refactoring that
slightly to gain access to the external seed version.  I don't want this
change to be made:
http://cr.openjdk.java.net/~sangheki/8220310/webrev.0/src/hotspot/share/gc/shared/taskqueue.inline.hpp.udiff.html

And do we *really* need randomness here at all?  Is there a reason not
to simply cycle through the nodes?

------------------------------------------------------------------------------
src/hotspot/os/linux/os_linux.cpp  
3016   size_t num_pages_left = size_in_bytes / page_size;

Is size_in_bytes required to be a multiple of page_size?  I believe it is,
and an assert of that would be useful.

------------------------------------------------------------------------------
src/hotspot/os/linux/os_linux.cpp 
3016   size_t num_pages_left = size_in_bytes / page_size;
3017   const size_t num_pages_total = num_pages_left;
3018   size_t num_buffer_entries = MIN2(num_pages_left, max_pages);

I think clearer would be

3017   const size_t num_pages_total = size_in_bytes / page_size;
3018   size_t num_buffer_entries = MIN2(num_pages_total, max_pages);

and move this down to the loop setup:
3016   size_t num_pages_left = num_pages_total;

------------------------------------------------------------------------------
src/hotspot/os/linux/os_linux.cpp  
3020   void** pages = (void**)alloca(sizeof(void*) * num_buffer_entries);
3021   int* status = (int*)alloca(sizeof(int) * num_buffer_entries);

I think a ResourceMark and NEW_RESOURCE_ARRAYs would be preferable to
using alloca() here.  alloca() has some dangers and I think is currently
only used in some very specialized places where it really is needed.

------------------------------------------------------------------------------
src/hotspot/os/linux/os_linux.cpp
3061 }
3062 // Moves current thread on a specific node and it will not migrate to

Missing blank line between functions.

------------------------------------------------------------------------------
src/hotspot/os/linux/os_linux.cpp
3031       pages[i] = (void*)base;

Unnecessary cast.

------------------------------------------------------------------------------
src/hotspot/os/linux/os_linux.cpp
3072     case AFFINITY_NONE:
3073       // Do nothing.
3074       return true;
3075       break;

Shouldn't this be calling numa_run_on_node with -1 node id?  The
documentation for numa_run_on_node() says "Passing -1 permits the
kernel to schedule on all nodes again."

But see also next comment about numa_affinity_set.

------------------------------------------------------------------------------
src/hotspot/share/runtime/os.hpp
 401   // Available affinities.
 402   enum NumaAffinity {
 403     AFFINITY_STRONG,
 404     AFFINITY_NONE
 405   };
 406   static bool numa_affinity_set(Thread* thread, int numa_id, NumaAffinity affinity);

Could there ever be other affinities?  This API doesn't seem right.
It seems to me it should be something like

  static bool numa_set_affinity(Thread* thread, int numa_id);
  static bool numa_clear_affinity(Thread* thread);

No need for NumaAffinity enum.

(I think numa_set/clear_affinity is better than numa_affinity_set/clear;
numa is a "namespace" and we usually set/clear/get_something rather
than something_set/clear/get.)

Or maybe this should just be

  static bool numa_set_affinity(Thread* thread, int numa_id);

with "clearing" provided by passing AnyId as the numa_id?  Still don't
need the NumaAffinity enum for this.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.cpp
 124   guarantee(rs.size() % page_size == 0, "Reserved space size (" SIZE_FORMAT ") should be aligned "
 141   guarantee(size_in_bytes % page_size == 0, "The given range (" SIZE_FORMAT ") should be aligned "
 209     guarantee((uintptr_t)G1HeapRegionSize % page_size == 0, "G1HeapRegionSize (" SIZE_FORMAT ") and page size ("
 234     guarantee((uintptr_t)_page_size % _heap_region_size == 0, "G1HeapRegionSize (" SIZE_FORMAT ") and page size ("

Use is_aligned.  I think casts shouldn't be needed.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.cpp
  43   _numa_id_to_index_map = NEW_C_HEAP_ARRAY_RETURN_NULL(uint, _len_numa_id_to_index_map, mtGC);
  44   memset(_numa_id_to_index_map, 0, sizeof(uint) * _len_numa_id_to_index_map);

Using NEW_xxx_RETURN_NULL and then ignoring the possibility of a NULL
result.  It would be better to use the non-_RETURN_NULL variant and
get a decent error message rather than the segfault from attemptint to
write to NULL.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.cpp 
  34 void G1NUMA::init_numa_id_to_index_map(int* numa_ids, uint num_numa_ids) {

int* numa_ids => const int* numa_ids.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.cpp 
  34 void G1NUMA::init_numa_id_to_index_map(int* numa_ids, uint num_numa_ids) {
...
  44   memset(_numa_id_to_index_map, 0, sizeof(uint) * _len_numa_id_to_index_map);

Rather than this memset and then writing to all of them later, instead
initialize all entries to InvalidNodeIndex and then assign from numa_ids.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1MemoryTouchThread.cpp
 261   _wakeup_monitor = new Monitor(Mutex::nonleaf, "Wakeup Monitor for Touch Control", true, Monitor::_safepoint_check_never);
 262   _sync_monitor = new Monitor(Mutex::nonleaf, "Sync Monitor for Touch Control", true, Monitor::_safepoint_check_never);

For singleton mutex/monitor, we generally prefer to put them in mutexLocker.
Probably passed as constructor arguments to G1NUMAThreadsControl.  And
I know there are counter-examples.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1MemoryNodeManager.hpp
  30 extern const uint AnyNodeIndex;
  31 extern const uint InvalidNodeIndex;

Why are these at namespace scope.  And they seem misplaced here.
Seems like they should belong to the G1NUMA class, which is
responsible for indexing nodes and mapping between indexes and ids. 

  30 const uint AnyNodeIndex = (uint)os::AnyId;
  31 const uint InvalidNodeIndex = (uint)os::InvalidId;

And why these values?  That seems like a units mismatch.  Just use
UINT_MAX and UINT_MAX - 1.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.hpp
  80   // For invalid numa id, return InvalidNodeIndex. So the caller need exception handling.

"need exception handling"?

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/heapRegionManager.cpp
 114     uint allocated_node_index;

This variable gets assigned via an out parameter here:
 116     hr = _free_list.remove_region_with_node_index(from_head, valid_node_index, &allocated_node_index);

and by assignment here:
 122         allocated_node_index = mgr->node_index_of_address(hr->bottom());

but is never used. If the lack of use is not itself a mistake, then the out
parameter of remove_region_with_node_index appears to be unnecessary.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/heapRegionSet.inline.hpp 
 151 inline HeapRegion* FreeRegionList::remove_region_with_node_index(bool from_head,

As written, I think this might always return NULL if requested_node_index is
not a valid index. I think callers ensure that it's valid, but an assert of
the argument would be help.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/heapRegionSet.inline.hpp 
 175     if (cur == NULL || cur_depth >= max_search_depth) { // not found

There's an off-by-one issue here. The max_search_depth time through the
while-control-expression may have found a matching region, checked via the
numa_index_of_address lookup.

Either the test should be "current_numa_index != requested_numa_index" or
the loop should be changed.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/heapRegionSet.inline.hpp 
 151 inline HeapRegion* FreeRegionList::remove_region_with_node_index(bool from_head,

I think the list manipulation could be simplified quite a bit.  I think we
don't need the "last" variable.  And the splicing out of cur can be unified
between the from_head and !from_head cases, e.g. something like this:

  // Find the region to use, searching from _head or _tail as requested.
  if (from_head) {
    for (cur = _head;
         cur != NULL && cur_depth < max_search_depth;
         cur = cur->next(), ++cur_depth) {
      if (current_numa_index == numa->numa_index_of_address(cur->bottom)) {
        break;
      }
    }
  } else {
    for (cur = _tail;
         cur != NULL && cur_depth < max_search_depth;
         cur = cur->prev(), ++cur_depth) {
      if (current_numa_index == numa->numa_index_of_address(cur->bottom)) {
        break;
      }
    }
  }
  if (cur == NULL || cur_depth >= max_search_depth) {
    return NULL;                      // Didn't find a region to use
  }
  // Splice the region out of the list.
  HeapRegion* prev = cur->prev();
  HeapRegion* next = cur->next();
  if (prev == NULL) {
    _head = next;
  } else {
    prev->set_next(next);
  }
  if (next == NULL) {
    _tail = prev;
  } else {
    next->set_prev(prev);
  }
  cur->set_prev(NULL);
  cur->set_next(NULL);
  ...

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.cpp
 166 class DumpIterator : public StackObj {

Polluting global namespace.  Suggest making DumpIterator and the classes
derived from it (G1HeapRegion{Larger,Smaller}Iterator) private nested
classes of G1NUMA, making the names unimportant outside of that context.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/heterogeneousHeapRegionManager.hpp
 122   virtual HeapRegion* allocate_free_region(HeapRegionType type);
 123   virtual HeapRegion* allocate_free_region(HeapRegionType type, uint node_index);

The base class (HeapRegionManager) no longer has the one-arg function.  So
this one-arg function is no longer overriding the base class definition.
(Too bad it wasn't declared as C++11 "override".)

I don't see any one-arg callers, so seems like it can just be removed.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1PageBasedNUMAVirtualSpace.cpp 

I dislike the new structure here, with G1PageBasedVirtualSpace now being
both a base class and a concrete class (which is often a problematic
design).  And the new derived class doesn't really do all that much.

It seems like it would be much simpler to just have G1PageBasedVirtualSpace
check UseNUMA to conditionalize it's commit and uncommit functions.
Especially since it's commit already needs to have some knowlege of NUMA
because of the new numa_index argument.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1PageBasedNUMAVirtualSpace.cpp 
  52 void G1PageBasedNUMAVirtualSpace::uncommit(size_t start_page, size_t size_in_pages) {

Do we actually need to do anything here?  We're resetting the
G1NUMA::_numa_id_of_pages_table (to AnyId), but it's not clear we really
need to do that.  When we later recommit any of those pages, that table will
be updated appropriately.

Maybe rather than having G1NUMA expose quite so much detail it should
instead have post_commit and a pre_uncommit functions.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1NUMA.hpp
  75   G1NUMAPageIdTable _numa_id_of_pages_table;

I was surprised by the need for this, but I can't find anything in the NUMA
API that provides this information.  That seems strange, because one would
think that information must exist somewhere in the NUMA implementation.

------------------------------------------------------------------------------