RFR(L): 8220312: Implementation: NUMA-Aware Memory Allocation for G1, Logging (3/3)

Thu Oct 31 15:31:39 UTC 2019

Hi Sangheon,

On 2019-10-23 08:39, sangheon.kim at oracle.com wrote:
> Hi Thomas,
> 
> I am posting the next webrev as Kim is waiting it.
> 
> Webrev:
> http://cr.openjdk.java.net/~sangheki/8220312/webrev.3
> http://cr.openjdk.java.net/~sangheki/8220312/webrev.3.inc

Here are my comments:
src/hotspot/share/gc/g1/g1CollectedHeap.hpp
---
2397     st->print("  remaining free region(s) from each node id: ");

What do you think about changing this to "... region(s) on each NUMA 
node: "? I think we should be clear about the logging being for NUMA.
---

src/hotspot/share/gc/g1/g1EdenRegions.hpp
---
33 class G1EdenRegions : public G1RegionCounts {

I don’t think G1EdenRegions is a G1RegionCounts but rather it should 
have one. So instead of using inheritance here I think G1EdenRegions 
should have a G1RegionsCount. Instead of overloading length I would then 
suggest adding a region_count(uint node_index) to get the count.

Same goes for G1SurvivorRegions.
---

src/hotspot/share/gc/g1/g1NUMA.cpp
---
  279 bool NodeIndexCheckClosure::do_heap_region(HeapRegion* hr) {
  280   uint preferred_node_index = 
_numa->preferred_node_index_for_index(hr->hrm_index());
  281   uint active_node_index = _numa->index_of_address(hr->bottom());
  282
  283   if (preferred_node_index == active_node_index) {
  284     _matched[preferred_node_index]++;
  285   } else if (active_node_index == G1NUMA::UnknownNodeIndex) {
  286     _unknown++;
  287   }
  288   _total++;
  289
  290   return false;
  291 }

As we discussed offline, I would like to know the mismatches as well, I 
think the easiest approach would be to make the total count per node as 
well and that way we can see if there were any regions that didn't 
match. What do you think about printing the info like this:
[3,009s][trace][gc,heap,numa ] GC(6) NUMA region verification 
(actual/expected): 0: 1024/1024, 1: 270/1024, Unknown: 0

When testing this I also realized this output is problematic in the case 
where we have committed regions that have not yet been used. Reading the 
manual for get_mempolicy (the way we get the numa id for the address) say:
"If no page has yet been allocated for the specified address, 
get_mempolicy() will allocate a page as if the thread had performed a 
read (load) access to that address, and return the ID of the node where 
that page was allocated."

Doing a read access seem to always get a page on NUMA node 0, so the 
accounting will not be correct in this case.

One way to fix this would be to only do accounting for regions currently 
used (!hr->is_free()) but I'm not sure that is exactly what we want, at 
least not if we only do this after the GC, then only the survivors and 
old will be checked. We could solve this by also do verification before 
the GC. I think this might be the way to go, what do you think? If my 
proposal was hard to follow, here's a patch:
http://cr.openjdk.java.net/~sjohanss/numa/verify-alternative/

The output from this patch would be:
9,233s][trace][gc,heap,numa   ] GC(18) GC Start: NUMA region 
verification (actual/expected): 0: 358/358, 1: 361/361, Unknown: 0
[9,306s][trace][gc,heap,numa   ] GC(18) GC End: NUMA region verification 
(actual/expected): 0: 348/348, 1: 347/347, Unknown: 0

One can also see that this verification takes some time, so maybe it 
would make sense to have this logging under gc+numa+verify.
---

  234   uint converted_req_index = requested_node_index;
  235   if(converted_req_index == AnyNodeIndex) {
  236     converted_req_index = _num_active_node_ids;
  237   }
  238   if (converted_req_index <= _num_active_node_ids) {
  239     _times->update(phase, converted_req_index, allocated_node_index);
  240   }

I had to read this more than once to understand what it really did and I 
think we can simplify it a bit, by just doing an if-else that checks for 
AnyNodeIndex and if so passes in _num_active_node_ids to update(). This 
should be ok since requested_node_index never can be larger than 
_num_active_node_ids.
---

src/hotspot/share/gc/g1/g1ParScanThreadState.cpp
---
I would prefer if we hide all the accounting in helper functions, but it 
might be good to declare them to be inlined.

   85   if (_numa->is_enabled()) {
   86     LogTarget(Info, gc, heap, numa) lt;
   87
   88     if (lt.is_enabled()) {
   89       uint num_nodes = _numa->num_active_nodes();
   90       // Record only if there are multiple active nodes.
   91       _obj_alloc_stat = NEW_C_HEAP_ARRAY(size_t, num_nodes, mtGC);
   92       memset((void*)_obj_alloc_stat, 0, sizeof(size_t) * num_nodes);
   93     }
   94   }

Move to something like initialize_numa_stats().

  108   if (_obj_alloc_stat != NULL) {
  109     uint node_index = _numa->index_of_current_thread();
  110 
_numa->copy_statistics(G1NodeTimes::LocalObjProcessAtCopyToSurv, 
node_index, _obj_alloc_stat);
  111   }

This could be called flush_numa_stats().

  268     if (_obj_alloc_stat != NULL) {
  269       _obj_alloc_stat[node_index]++;
  270     }

And this something like update_numa_stats(uint).
--

heapRegionSet.hpp
---
159   inline void update_length(HeapRegion* hr, bool increase);
254   inline void update_length(HeapRegion* hr, bool increase);

Is there any reason for having update_length that takes a bool rather 
than having one function for increments and one for decrements? To me it 
looks like all uses are pretty well defined and it would make the code 
easier to read. I also think we could pass in the node index rather than 
the HeapRegion since the getter lenght() does this.
---

src/hotspot/share/gc/g1/g1NodeTimes.cpp
---
First, a question about the names, G1NodeTimes signals that it has to do 
with timing, but currently we don't really record any timings. Same 
thing with NodeStatPhases, not really the same type of phases that we 
have for the rest of the GC logging. What do you think about renaming 
the class to G1NUMAStats and the enum to NodeDataItems?

  166 void G1NodeTimes::print_phase_info(G1NodeTimes::NodeStatPhases 
phase) {
  167   LogTarget(Info, gc, heap, numa) lt;

I think this should be on debug level, but if you don't agree leave it 
as is.
---

  191 void G1NodeTimes::print_mutator_alloc_stat_debug() {
  192   LogTarget(Debug, gc, heap, numa) lt;

And if you agree on moving the above to debug I think this should be on 
trace level.
---

This is it for now. Thanks,
Stefan

> Testing: hs-tier 1 ~ 4 with/without UseNUMA. hs-tier5 is almost finished 
> without new failures.
> 
> Thanks,
> Sangheon
> 
>