回复:State of "simplified barriers" for G1
林育德
yude.lyd at alibaba-inc.com
Tue Dec 22 10:31:02 UTC 2020
Hi All,
We are also interested in any follow-ups on this topic. If I recall correctly, when this was discussed in JDK-8226197, one of the TODOs was that the storeload fence can be skipped when Conc Refine is turned off. Regarding this, I'd like to share an idea we have been experimenting in the last couple of months. We took "skipping the fence" a little further and tried to improve the throughput with less harm to pause time.
This is from the observation that many card dirtying operations can go away without concurrent refine. More specifically, writes that produce a reference OldObj1.foo->OldObj2 need not dirty the card corresponding to OldObj1 during young-gc-only phase. Currently, with Conc Refine, this operation will dirty that card, then the card will be refined (thrown away) by the refinement thread, because it discovers that the reference points to an Old region, which is "untracked" during young-gc-only phase.
The refinement thread does this concurrently so that GC doesn't have to do it during a pause. But we (~lmao) realized that we can use a flag to indicate whether a region is tracked, and discard the card dirtying operation immediately in the barrier (after testing against the flag). We can do it without any atomics/fences, just ~5 instructions in the barrier. This way, we get rid of the storeload mem barrier, with Conc Refine turned off, while still getting the same pause time guarantee in young-gc-only phase. But as you can see, Mixed GCs still suffer from having no concurrent refinement.
We saw improvements on Alibaba JDK11u across the benchmarks we used (positive number means better):
Dacapo: cases vary from -3.3% to +5.1%, on average +0.3%
specjbb2015 on 96x2.50GHz, 16 GC threads, 24g mem: critical-jOPS +1.9%, max-jOPS +2.8%
specjbb2015 on 8x2.50GHz, 8 GC threads, 16g mem (observed more Mixed GCs): critical-jOPS +0.1%, max-jOPS +5.7%
specjvm2008: cases vary from -0.7% to +23.4%, on average +3.1%
Extremem: cases vary from -2.1% to +7.8%, on average +1.0%
I'd love to hear any feedbacks, comments, what problems you can see in this approach, conceptually or practically, and back to the topic, whether this idea can be incorporated into your future work/plan of creating a simplified barrier.
Yude Lin
------------------------------------------------------------------
发件人:Gerhard Hueller <ghueller at outlook.com>
发送时间:2020年12月21日(星期一) 03:19
收件人:hotspot-gc-dev at openjdk.java.net <hotspot-gc-dev at openjdk.java.net>
主 题:State of "simplified barriers" for G1
Hi,
I remember a slide deck talking about the improvements to G1 since JDK8/9 and one bullet point on the todo-list was simplified barriers for G1.
I wonder what happened to this improvement, has it been already implemented? Is this the non-concurrent refinement option implemented by google some time ago?
Improvements in this area would be really great, CMS still provides better throughput for most workloads - with the only real advantage of G1 does offer are avoiding those degenerated STW full GCs.
Thanks, Gerhard
More information about the hotspot-gc-dev
mailing list