Re: Discussion for 8226197: Reducing G1’s CPU cost with simplified write post-barrier and disabling concurrent refinement

Tue Jun 18 00:52:48 UTC 2019

> On Jun 14, 2019, at 9:41 PM, Man Cao <manc at google.com> wrote:
> 
> Hi all,
> 
> I'd like to discuss the feasibility of supporting a new mode of G1 that uses a simplified write post-barrier. The idea is basically trading off some pause time with CPU time, more details are in:
> https://bugs.openjdk.java.net/browse/JDK-8226197
> 
> A prototype implementation is here:
> https://cr.openjdk.java.net/~manc/8226197/webrev.00/
> 
> At a high level, other than the maintenance issue of supporting two different types of write barrier, is there anything inherently wrong about this approach? I have run fastdebug build with various GC verification options turned on to stress test the prototype, and so far I have not found any failures due to the prototype.
> 
> For the patch itself, besides the changes to files related to the write barrier, the majority of the change is in g1RemSet.cpp, where it needs to scan all dirty cards for regions not in the collection set. This phase (called process_card_table()) replaces the update_rem_set() phase during evacuation pause, and is similar to ClearNoncleanCardWrapper::do_MemRegion() for CMS.
> There are certainly many things can be improved for the patch, e.g., G1Analytics should take into account and adapt to the time spent in process_card_table(), and process_card_table() could be further optimized to speed up the scanning. We'd like to discuss about this approach in general before further improving it.
> In addition, we will collect more performance results with real production workload in a week or two.
> 
> -Man

As Thomas said, we (the Oracle GC team) have been considering
something like this (off and on) for some time; it has come up for
discussion several times. So far, we haven't pursued the idea to
completion / integration, and there are a number of reasons for this.

Fundamentally the idea is to improve G1's throughput performance at
the (potential) expense of its latency behavior.   Let's call this
G1-throughput mode below.

One argument against this is that if what one cares about is
throughput, then we already have a throughput-oriented collector
(ParallelGC), and one should just use that. G1-throughput mode isn't
useful for a pure throughput use-case unless it can beat ParallelGC.
There might be some cases where that happens, but it's not clear how
common that is.

One knock against ParallelGC is that the latency can be *really* bad.
Even if G1-throughput mode doesn't beat ParallelGC for throughput,
there may be mostly throughput-oriented applications that are somewhat
sensitive to latency, and such could benefit from G1-throughput mode.
But we're not sure how common such applications really are.

The cost of adding G1-throughput mode should not be discounted. "other
than the maintenance issue of supporting two types of write barrier"
kind of trivializes that cost. From a testing point of view, it's
pretty close to being a whole new collector, likely requiring running
a large number of tests in G1-throughput mode. We think the additional
testing cost and potential bug tail are a substantial downside.

So far, we've ended up not pursuing this course and instead focusing
our efforts on narrowing the space between G1 and ParallelGC, mostly
by improving G1's throughput performance and by better ergonomics and
tuning guidance. Thomas may have some data on what progress has been
made, and there are lots of good ideas left to pursue. (JDK-8220465
would help on the ParallelGC side; unfortunately, nobody from Oracle
has had time to devote to it.) The idea is to narrow the application
space between G1 and Parallel where G1-throughput mode naturally lives.

(I've only given the proposed changeset a cursory skim so far. I'm
waiting for discussion on the high-level question of whether this is a
direction that will be pursued.)