<div dir="ltr"><div>Thanks for both responses!</div><div>I will certainly take a look at JDK-8213108, and will rebase our patch on top of JDK-8213108. Hopefully it will make our patch smaller.<br></div><div><br></div><div>Regarding to whether to support G1-throughput mode in the long term, could we set up a video conference meeting to chat about it?</div><div>Below are some reasoning on why we would want to have the throughput mode.<br></div><div><br></div><div>We have been thinking about something like a throughput-mode even before the idea of simplified write barrier.</div><div>For example, for some throughput-oriented workload with very large heap, repeatedly moving old-gen objects in mixed collections could be costly.</div><div>We found that setting "-XX:InitiatingHeapOccupancyPercent=100 -XX:-G1UseAdaptiveIHOP" to disable concurrent and mixed collections is quite helpful for that case.</div><div>The simplified write barrier is a different direction to improve throughput, and user could still have the option to keep concurrent and mixed collections enabled.<br></div><div>These different approaches could surely be presented in one JEP for the throughput mode.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">One argument against this is that if what one cares about is<br>throughput, then we already have a throughput-oriented collector<br>(ParallelGC), and one should just use that. G1-throughput mode isn't<br>useful for a pure throughput use-case unless it can beat ParallelGC.<br>There might be some cases where that happens, but it's not clear how<br>common that is.<br>One knock against ParallelGC is that the latency can be *really* bad.<br>Even if G1-throughput mode doesn't beat ParallelGC for throughput,<br>there may be mostly throughput-oriented applications that are somewhat<br>sensitive to latency, and such could benefit from G1-throughput mode.<br>But we're not sure how common such applications really are.</blockquote><div><br></div><div>It will save us a lot of maintenance work if we only need to support one garbage collector.</div><div>On the other hand, G1 is supposedly an all-around collector that can be tuned for either pause time or throughput. It will be good to make G1 perform well when it is tuned for throughput.</div><div><br></div><div>For most of our workloads that are highly tuned for CMS, they trigger young-gen collections frequently, but rarely any concurrently collections.<br></div><div>Arguably ParallelGC might also work for these workload, but as you mentioned, the tail latency could be really bad.</div><div>For these workloads, the concurrently collections in CMS is more like a safety net, that collects any garbage infrequently spilled into the old-gen.</div><div><br></div><div>For these workloads, migrating them to G1 in JDK11 does not show much reduction in latency, but significant increase in CPU usage, which directly translates a big drop in queries-per-second (because CPU quota is the same).<br></div><div>The simplified write-barrier approach will certainly help these cases by reducing CPU usage. And since the old-gen is lightly used, scanning all cards for the used part of the old-gen during a pause would not be prohibitively expensive.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">So far, we've ended up not pursuing this course and instead focusing<br>our efforts on narrowing the space between G1 and ParallelGC, mostly<br>by improving G1's throughput performance and by better ergonomics and<br>tuning guidance. Thomas may have some data on what progress has been<br>made, and there are lots of good ideas left to pursue. (JDK-8220465<br>would help on the ParallelGC side; unfortunately, nobody from Oracle<br>has had time to devote to it.) The idea is to narrow the application<br>space between G1 and Parallel where G1-throughput mode naturally lives.</blockquote><div><br></div><div>Without changing how the write barrier works, I'm skeptical if we can recover most of the increase in CPU usage for those workloads that mostly trigger young-gen collections.</div><div>As mentioned above, the issue is actually more about CPU usage than typical definition of throughput (i.e. wall time to finish a fixed amount of work), due to how containerized environment and load balancing works. More CPU usage typically means the machine will receive less work.</div><div><br></div><div><div dir="ltr" class="m_-6136172448446788291m_-919424784070192031gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">-Man</div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jun 17, 2019 at 5:52 PM Kim Barrett <<a href="mailto:kim.barrett@oracle.com" target="_blank">kim.barrett@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> On Jun 14, 2019, at 9:41 PM, Man Cao <<a href="mailto:manc@google.com" target="_blank">manc@google.com</a>> wrote:<br>
> <br>
> Hi all,<br>
> <br>
> I'd like to discuss the feasibility of supporting a new mode of G1 that uses a simplified write post-barrier. The idea is basically trading off some pause time with CPU time, more details are in:<br>
> <a href="https://bugs.openjdk.java.net/browse/JDK-8226197" rel="noreferrer" target="_blank">https://bugs.openjdk.java.net/browse/JDK-8226197</a><br>
> <br>
> A prototype implementation is here:<br>
> <a href="https://cr.openjdk.java.net/~manc/8226197/webrev.00/" rel="noreferrer" target="_blank">https://cr.openjdk.java.net/~manc/8226197/webrev.00/</a><br>
> <br>
> At a high level, other than the maintenance issue of supporting two different types of write barrier, is there anything inherently wrong about this approach? I have run fastdebug build with various GC verification options turned on to stress test the prototype, and so far I have not found any failures due to the prototype.<br>
> <br>
> For the patch itself, besides the changes to files related to the write barrier, the majority of the change is in g1RemSet.cpp, where it needs to scan all dirty cards for regions not in the collection set. This phase (called process_card_table()) replaces the update_rem_set() phase during evacuation pause, and is similar to ClearNoncleanCardWrapper::do_MemRegion() for CMS.<br>
> There are certainly many things can be improved for the patch, e.g., G1Analytics should take into account and adapt to the time spent in process_card_table(), and process_card_table() could be further optimized to speed up the scanning. We'd like to discuss about this approach in general before further improving it.<br>
> In addition, we will collect more performance results with real production workload in a week or two.<br>
> <br>
> -Man<br>
<br>
As Thomas said, we (the Oracle GC team) have been considering<br>
something like this (off and on) for some time; it has come up for<br>
discussion several times. So far, we haven't pursued the idea to<br>
completion / integration, and there are a number of reasons for this.<br>
<br>
Fundamentally the idea is to improve G1's throughput performance at<br>
the (potential) expense of its latency behavior. Let's call this<br>
G1-throughput mode below.<br>
<br>
One argument against this is that if what one cares about is<br>
throughput, then we already have a throughput-oriented collector<br>
(ParallelGC), and one should just use that. G1-throughput mode isn't<br>
useful for a pure throughput use-case unless it can beat ParallelGC.<br>
There might be some cases where that happens, but it's not clear how<br>
common that is.<br>
<br>
One knock against ParallelGC is that the latency can be *really* bad.<br>
Even if G1-throughput mode doesn't beat ParallelGC for throughput,<br>
there may be mostly throughput-oriented applications that are somewhat<br>
sensitive to latency, and such could benefit from G1-throughput mode.<br>
But we're not sure how common such applications really are.<br>
<br>
The cost of adding G1-throughput mode should not be discounted. "other<br>
than the maintenance issue of supporting two types of write barrier"<br>
kind of trivializes that cost. From a testing point of view, it's<br>
pretty close to being a whole new collector, likely requiring running<br>
a large number of tests in G1-throughput mode. We think the additional<br>
testing cost and potential bug tail are a substantial downside.<br>
<br>
So far, we've ended up not pursuing this course and instead focusing<br>
our efforts on narrowing the space between G1 and ParallelGC, mostly<br>
by improving G1's throughput performance and by better ergonomics and<br>
tuning guidance. Thomas may have some data on what progress has been<br>
made, and there are lots of good ideas left to pursue. (JDK-8220465<br>
would help on the ParallelGC side; unfortunately, nobody from Oracle<br>
has had time to devote to it.) The idea is to narrow the application<br>
space between G1 and Parallel where G1-throughput mode naturally lives.<br>
<br>
(I've only given the proposed changeset a cursory skim so far. I'm<br>
waiting for discussion on the high-level question of whether this is a<br>
direction that will be pursued.)<br>
<br>
</blockquote></div>