Perf: SATB and WB coalescing

Wed Jan 10 11:12:41 UTC 2018

Am 10.01.2018 um 10:45 schrieb Aleksey Shipilev:
> If you do a few back-to-back reference stores, like this:
> 
> http://icedtea.classpath.org/hg/gc-bench/file/6ec38e1bea7a/src/main/java/org/openjdk/gcbench/wip/BarriersMultiple.java
> 
> Then you shall find what WB coalescing breaks because of the SATB barriers in-between. See:
> 
> *) No WB, no SATB -> back-to-back stores:
>    http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/noWB-noSATB.perfasm
> 
> *) WB, but no SATB -> initial evac-in-progress check, then back-to-back stores with RBs:
>    http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/WB-noSATB.perfasm
> 
> *) WB with SATB -> interleaved evac-in-progress and conc-mark-in-progress checks:
>    http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/WB-SATB.perfasm
> 
> It seems the impact of the non-coalesced SATB barriers alone is the culprit, and WB coalescing is
> the second-order effect:
> 
> Benchmark                                 Mode  Cnt   Score    Error  Units
> 
> # Base
> BarriersMultiple.test                     avgt   15   2.739 ±  0.003  ns/op
> BarriersMultiple.test:L1-dcache-loads     avgt    3  13.128 ±  0.475   #/op
> BarriersMultiple.test:L1-dcache-stores    avgt    3   8.103 ±  0.133   #/op
> BarriersMultiple.test:branches            avgt    3   4.039 ±  0.213   #/op
> BarriersMultiple.test:cycles              avgt    3  10.344 ±  0.413   #/op
> BarriersMultiple.test:instructions        avgt    3  30.273 ±  1.280   #/op
> 
> # +WB
> BarriersMultiple.test                     avgt   15   3.459 ±  0.011  ns/op
> BarriersMultiple.test:L1-dcache-loads     avgt    3  19.195 ±  0.638   #/op // +6
> BarriersMultiple.test:L1-dcache-stores    avgt    3   8.080 ±  0.539   #/op
> BarriersMultiple.test:branches            avgt    3   4.045 ±  0.118   #/op
> BarriersMultiple.test:cycles              avgt    3  13.031 ±  0.324   #/op // +3
> BarriersMultiple.test:instructions        avgt    3  40.426 ±  1.133   #/op
> 
> # +SATB
> BarriersMultiple.test                     avgt   15   3.620 ±  0.005  ns/op
> BarriersMultiple.test:L1-dcache-loads     avgt    3  18.148 ±  0.519   #/op // +5
> BarriersMultiple.test:L1-dcache-stores    avgt    3   8.065 ±  0.409   #/op
> BarriersMultiple.test:branches            avgt    3  13.115 ±  0.423   #/op
> BarriersMultiple.test:cycles              avgt    3  13.628 ±  0.471   #/op // +3.5
> BarriersMultiple.test:instructions        avgt    3  49.421 ±  1.880   #/op
> 
> # +SATB +WB
> BarriersMultiple.test                     avgt   15   4.923 ±  0.040  ns/op
> BarriersMultiple.test:L1-dcache-loads     avgt    3  28.269 ±  1.519   #/op // +15 (should be +11)
> BarriersMultiple.test:L1-dcache-stores    avgt    3   8.112 ±  1.161   #/op
> BarriersMultiple.test:branches            avgt    3  13.134 ±  1.134   #/op
> BarriersMultiple.test:cycles              avgt    3  18.561 ±  1.198   #/op // +8 (should be +6.5)
> BarriersMultiple.test:instructions        avgt    3  56.577 ±  4.024   #/op
> 
> I wonder if that means we need to go forward with tracking the GC state in one single flag, and
> polling it with different masks, then coalescing the paths when masks are similar?
> 
> Thanks,
> -Aleksey
> 

That confirms what I suspected since a while. And I also sorta hope that 
the traversal GC will solve it, because it only ever polls a single 
flag. We might even want to wrap RBs into evac-flag-checks initially, so 
that the optimizer can coalesce them too, and remove lone 
evac-checks-around-RBs after optimization.

Another related issue may be that both the GC barriers and a bunch of 
other stuff pollutes the raw memory slice. Which means that an 
interleaving allocation (among other stuff) in between barriers may 
prevent coalescing and optimization. I wonder if it makes sense to put 
all GC barriers on a separate memory slice instead? We basically need a 
memory slice that says 'stuff on this slice only ever changes at 
safepoints'.

Roman

Roman