State of "simplified barriers" for G1

Fri Feb 5 12:18:18 UTC 2021

Hi all,

   sorry for chiming in so late answer, due to holidays and email server 
move that email thread got lost.

On 05.02.21 09:47, Man Cao wrote:
> Hi All,
> 
> My apology for postponing this. I've been busy rolling out JDK 11 to all
> our production servers for the last year.
[...]
> and JDK-8230187 in JDK 17. I'll send a separate email for JDK-8226731, as
> there are still some challenges there.

Great to hear!

> 
> Yude, thanks for sharing the ideal and results! I think it is best to open
> a new RFE for further improvement after JDK-8230187 is implemented.
> If I understand correctly, the proposed approach avoids dirtying the cards
> for old-to-old reference stores in young-only phases. That's a nice idea.
> Are the results comparing the two types of simplified write barriers? Or is
> for comparing the default barrier with the storeload fence, vs your
> simplified write barrier that filters untracked regions?
> 
> -Man
> 
> 
> On Tue, Dec 22, 2020 at 2:31 AM 林育德 <yude.lyd at alibaba-inc.com> wrote:
> 
>> Hi All,
>>
>> We are also interested in any follow-ups on this topic. If I recall
>> correctly, when this was discussed in JDK-8226197, one of the TODOs was
>> that the storeload fence can be skipped when Conc Refine is turned off.
>> Regarding this, I'd like to share an idea we have been experimenting in the
>> last couple of months. We took "skipping the fence" a little further and
>> tried to improve the throughput with less harm to pause time.
>>
>> This is from the observation that many card dirtying operations can go
>> away without concurrent refine. More specifically, writes that produce a
>> reference OldObj1.foo->OldObj2 need not dirty the card corresponding to
>> OldObj1 during young-gc-only phase. Currently, with Conc Refine, this
>> operation will dirty that card, then the card will be refined (thrown away)
>> by the refinement thread, because it discovers that the reference points to
>> an Old region, which is "untracked" during young-gc-only phase.
>>
>> The refinement thread does this concurrently so that GC doesn't have to do
>> it during a pause. But we (~lmao) realized that we can use a flag to
>> indicate whether a region is tracked, and discard the card dirtying
>> operation immediately in the barrier (after testing against the flag). We
>> can do it without any atomics/fences, just ~5 instructions in the barrier.
>> This way, we get rid of the storeload mem barrier, with Conc Refine turned
>> off, while still getting the same pause time guarantee in young-gc-only
>> phase. But as you can see, Mixed GCs still suffer from having no concurrent
>> refinement.
>>
>> We saw improvements on Alibaba JDK11u across the benchmarks we used
>> (positive number means better):
>> Dacapo: cases vary from -3.3% to +5.1%, on average +0.3%
>> specjbb2015 on 96x2.50GHz, 16 GC threads, 24g mem: critical-jOPS +1.9%,
>> max-jOPS +2.8%
>> specjbb2015 on 8x2.50GHz, 8 GC threads, 16g mem (observed more Mixed GCs):
>> critical-jOPS +0.1%, max-jOPS +5.7%
>> specjvm2008: cases vary from -0.7% to +23.4%, on average +3.1%
>> Extremem: cases vary from -2.1% to +7.8%, on average +1.0%
>> I'd love to hear any feedbacks, comments, what problems you can see in
>> this approach, conceptually or practically, and back to the topic, whether
>> this idea can be incorporated into your future work/plan of creating a
>> simplified barrier.

Fwiw, this sounds what I was trying when I was working on remembered 
sets and barriers for something like G1.

 From what I remember these changes yielded mixed results (for DaCapo 
and other small benchmarks with contemporary desktop machines) similar 
to yours so it has been dropped at that time (and the comparison point 
you gave is not clear, and I do not remember what I compared exactly).

Basically there has been a table containing a word whether we track 
outgoing (i.e. what the "young" marks on the card table currently do) or 
incoming references (i.e. whether the region needs remembered set 
updates), which sounds very similar to what you have done.

If concurrent refinement is turned off you do not need the storeload - 
but then it can be advantageous to avoid dirtying cards as much as 
possible to decrease work during gc, this is correct.

Also, as you might have noticed from CRs being filed we are actively 
thinking about improving the current barriers wrt to code size (e.g. 
JDK-8256279, JDK-8256282, ... not sure if everything has been filed yet 
what we thought of) and general footprint (e.g. refactoring the 
PtrQueues, dropping some TLS data to make room for other data to 
decrease code size)

>>
>> Yude Lin
>>

Thanks,
   Thomas