State of "simplified barriers" for G1

Fri Feb 5 08:47:10 UTC 2021

Hi All,

My apology for postponing this. I've been busy rolling out JDK 11 to all
our production servers for the last year.

The current state is that the OpenJDK GC team and us have determined to
implement https://bugs.openjdk.java.net/browse/JDK-8226731 first, before
committing the simplified write barrier. We'd like to get rid of the
storeload fence even with Conc Refine enabled. Note that JDK-8230187
contains the most up-to-date description for the proposed simplified writer
barrier, JDK-8226197 is a bit outdated. I target to get both JDK-8226731
and JDK-8230187 in JDK 17. I'll send a separate email for JDK-8226731, as
there are still some challenges there.

Yude, thanks for sharing the ideal and results! I think it is best to open
a new RFE for further improvement after JDK-8230187 is implemented.
If I understand correctly, the proposed approach avoids dirtying the cards
for old-to-old reference stores in young-only phases. That's a nice idea.
Are the results comparing the two types of simplified write barriers? Or is
for comparing the default barrier with the storeload fence, vs your
simplified write barrier that filters untracked regions?

-Man

On Tue, Dec 22, 2020 at 2:31 AM 林育德 <yude.lyd at alibaba-inc.com> wrote:

> Hi All,
>
> We are also interested in any follow-ups on this topic. If I recall
> correctly, when this was discussed in JDK-8226197, one of the TODOs was
> that the storeload fence can be skipped when Conc Refine is turned off.
> Regarding this, I'd like to share an idea we have been experimenting in the
> last couple of months. We took "skipping the fence" a little further and
> tried to improve the throughput with less harm to pause time.
>
> This is from the observation that many card dirtying operations can go
> away without concurrent refine. More specifically, writes that produce a
> reference OldObj1.foo->OldObj2 need not dirty the card corresponding to
> OldObj1 during young-gc-only phase. Currently, with Conc Refine, this
> operation will dirty that card, then the card will be refined (thrown away)
> by the refinement thread, because it discovers that the reference points to
> an Old region, which is "untracked" during young-gc-only phase.
>
> The refinement thread does this concurrently so that GC doesn't have to do
> it during a pause. But we (~lmao) realized that we can use a flag to
> indicate whether a region is tracked, and discard the card dirtying
> operation immediately in the barrier (after testing against the flag). We
> can do it without any atomics/fences, just ~5 instructions in the barrier.
> This way, we get rid of the storeload mem barrier, with Conc Refine turned
> off, while still getting the same pause time guarantee in young-gc-only
> phase. But as you can see, Mixed GCs still suffer from having no concurrent
> refinement.
>
> We saw improvements on Alibaba JDK11u across the benchmarks we used
> (positive number means better):
> Dacapo: cases vary from -3.3% to +5.1%, on average +0.3%
> specjbb2015 on 96x2.50GHz, 16 GC threads, 24g mem: critical-jOPS +1.9%,
> max-jOPS +2.8%
> specjbb2015 on 8x2.50GHz, 8 GC threads, 16g mem (observed more Mixed GCs):
> critical-jOPS +0.1%, max-jOPS +5.7%
> specjvm2008: cases vary from -0.7% to +23.4%, on average +3.1%
> Extremem: cases vary from -2.1% to +7.8%, on average +1.0%
> I'd love to hear any feedbacks, comments, what problems you can see in
> this approach, conceptually or practically, and back to the topic, whether
> this idea can be incorporated into your future work/plan of creating a
> simplified barrier.
>
> Yude Lin
>
>
> ------------------------------------------------------------------
> 发件人：Gerhard Hueller <ghueller at outlook.com>
> 发送时间：2020年12月21日(星期一) 03:19
> 收件人：hotspot-gc-dev at openjdk.java.net <hotspot-gc-dev at openjdk.java.net>
> 主 题：State of "simplified barriers" for G1
>
> Hi,
>
> I remember a slide deck talking about the improvements to G1 since JDK8/9
> and one bullet point on the todo-list was simplified barriers for G1.
>
> I wonder what happened to this improvement, has it been already
> implemented? Is this the non-concurrent refinement option implemented by
> google some time ago?
> Improvements in this area would be really great, CMS still provides better
> throughput for most workloads - with the only real advantage of G1 does
> offer are avoiding those degenerated STW full GCs.
>
> Thanks, Gerhard