Discussion for 8226197: Reducing G1’s CPU cost with simplified write post-barrier and disabling concurrent refinement

Sat Jun 15 18:16:39 UTC 2019

Hi Man,

  very initial comments from me:

On Fri, 2019-06-14 at 18:41 -0700, Man Cao wrote:
> Hi all,
> 
> I'd like to discuss the feasibility of supporting a new mode of G1
> that uses a simplified write post-barrier. The idea is basically
> trading off some pause time with CPU time, more details are in:
> https://bugs.openjdk.java.net/browse/JDK-8226197
> 
> A prototype implementation is here:
> https://cr.openjdk.java.net/~manc/8226197/webrev.00/
> 
> At a high level, other than the maintenance issue of supporting two
> different types of write barrier, is there anything inherently wrong
> about this approach? I have run fastdebug build with various GC 

No, it's not a wrong approach. We have done a similar prototype in 8u20
timeframe (2015?) with partially good results. There should be some CRs
in the bug tracker referencing some "Throughput mode/Throughput
remembered set" (not sure right now if public). However we haven't had
the time to look into it in more detail since then.

There is the question of maintenance too; this mode may require
significant additional amount of (continuous) testing. This resource
problem may be a bigger issue than an implementation...

> verification options turned on to stress test the prototype, and so
> far I have not found any failures due to the prototype.
> 
> For the patch itself, besides the changes to files related to the
> write barrier, the majority of the change is in g1RemSet.cpp, where
> it needs to scan all dirty cards for regions not in the collection
> set. This phase (called process_card_table()) replaces
> the update_rem_set() phase during evacuation pause, and is similar
> to ClearNoncleanCardWrapper::do_MemRegion() for CMS.

Did you look at the latest JDK-8213108 changes? That should be very
similar to what you describe (on a very high level) for the current
remembered set :) As a side effect it also improves pause time
performance quite a bit :P

Not having looked at your code, e.g. I believe this change could simply
reuse that code with some "minor" evacuation setup changes.

> There are certainly many things can be improved for the patch,
> e.g., G1Analytics should take into account and adapt to the time
> spent in process_card_table(), and process_card_table() could be 

JDK-8213108 also revamps how work is attributed to several phases, and
from my experience with that implementation changes in that area are
necessary to get best operation (or at least not fall into some ugly
performance potholes in some cases). I honestly expect this change to
require similar care in this area.

> further optimized to speed up the scanning. We'd like to discuss 
> about this approach in general before further improving it.
> In addition, we will collect more performance results with real
> production workload in a week or two.

Our experience with our prototype is (from memory):

- significantly improves throughput; also in some cases you are only
hitting other throughput inhibiting problems, like JDK-8131668/JDK-
8159429, so no impact ;)
You get very close or exceed parallel/CMS with that in the "well
working" cases though.

- it decreased latencies, i.e. increased pause times in some cases as
refinement can prevent doing lots of work that is not necessary in the
current evacuation; of course this depends on implementation.
Note that that prototype is also really old now and also did not have
significant changes to ergonomics so they may have been caused by that,
so results might have changed a lot.

Such cases may be better served by the current implementation too.

>From internal discussions, we found that there are a lot of options
when implementing such a mode though, ranging from basically Parallel
GC to something very close to current G1. E.g. I would still push
"refinement" of cards, i.e. actual addition of remembered set entries
to a concurrent phase.
So no particular comment about this implementation until I have had
time looking at it and more information about your plans with that :)

Either way I think it is not an "inherently wrong" path to go down if
your goal is improving throughput.

Finally, I believe this is a sufficiently large and significant change
to be worth a JEP. First, to describe its operation somewhere in
detail, second to get the attention for people to try it out :)

Thanks,
  Thomas