Perf: SATB and WB coalescing
Roman Kennke
rkennke at redhat.com
Wed Jan 10 11:12:41 UTC 2018
Am 10.01.2018 um 10:45 schrieb Aleksey Shipilev:
> If you do a few back-to-back reference stores, like this:
>
> http://icedtea.classpath.org/hg/gc-bench/file/6ec38e1bea7a/src/main/java/org/openjdk/gcbench/wip/BarriersMultiple.java
>
> Then you shall find what WB coalescing breaks because of the SATB barriers in-between. See:
>
> *) No WB, no SATB -> back-to-back stores:
> http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/noWB-noSATB.perfasm
>
> *) WB, but no SATB -> initial evac-in-progress check, then back-to-back stores with RBs:
> http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/WB-noSATB.perfasm
>
> *) WB with SATB -> interleaved evac-in-progress and conc-mark-in-progress checks:
> http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/WB-SATB.perfasm
>
> It seems the impact of the non-coalesced SATB barriers alone is the culprit, and WB coalescing is
> the second-order effect:
>
> Benchmark Mode Cnt Score Error Units
>
> # Base
> BarriersMultiple.test avgt 15 2.739 ± 0.003 ns/op
> BarriersMultiple.test:L1-dcache-loads avgt 3 13.128 ± 0.475 #/op
> BarriersMultiple.test:L1-dcache-stores avgt 3 8.103 ± 0.133 #/op
> BarriersMultiple.test:branches avgt 3 4.039 ± 0.213 #/op
> BarriersMultiple.test:cycles avgt 3 10.344 ± 0.413 #/op
> BarriersMultiple.test:instructions avgt 3 30.273 ± 1.280 #/op
>
> # +WB
> BarriersMultiple.test avgt 15 3.459 ± 0.011 ns/op
> BarriersMultiple.test:L1-dcache-loads avgt 3 19.195 ± 0.638 #/op // +6
> BarriersMultiple.test:L1-dcache-stores avgt 3 8.080 ± 0.539 #/op
> BarriersMultiple.test:branches avgt 3 4.045 ± 0.118 #/op
> BarriersMultiple.test:cycles avgt 3 13.031 ± 0.324 #/op // +3
> BarriersMultiple.test:instructions avgt 3 40.426 ± 1.133 #/op
>
> # +SATB
> BarriersMultiple.test avgt 15 3.620 ± 0.005 ns/op
> BarriersMultiple.test:L1-dcache-loads avgt 3 18.148 ± 0.519 #/op // +5
> BarriersMultiple.test:L1-dcache-stores avgt 3 8.065 ± 0.409 #/op
> BarriersMultiple.test:branches avgt 3 13.115 ± 0.423 #/op
> BarriersMultiple.test:cycles avgt 3 13.628 ± 0.471 #/op // +3.5
> BarriersMultiple.test:instructions avgt 3 49.421 ± 1.880 #/op
>
> # +SATB +WB
> BarriersMultiple.test avgt 15 4.923 ± 0.040 ns/op
> BarriersMultiple.test:L1-dcache-loads avgt 3 28.269 ± 1.519 #/op // +15 (should be +11)
> BarriersMultiple.test:L1-dcache-stores avgt 3 8.112 ± 1.161 #/op
> BarriersMultiple.test:branches avgt 3 13.134 ± 1.134 #/op
> BarriersMultiple.test:cycles avgt 3 18.561 ± 1.198 #/op // +8 (should be +6.5)
> BarriersMultiple.test:instructions avgt 3 56.577 ± 4.024 #/op
>
> I wonder if that means we need to go forward with tracking the GC state in one single flag, and
> polling it with different masks, then coalescing the paths when masks are similar?
>
> Thanks,
> -Aleksey
>
That confirms what I suspected since a while. And I also sorta hope that
the traversal GC will solve it, because it only ever polls a single
flag. We might even want to wrap RBs into evac-flag-checks initially, so
that the optimizer can coalesce them too, and remove lone
evac-checks-around-RBs after optimization.
Another related issue may be that both the GC barriers and a bunch of
other stuff pollutes the raw memory slice. Which means that an
interleaving allocation (among other stuff) in between barriers may
prevent coalescing and optimization. I wonder if it makes sense to put
all GC barriers on a separate memory slice instead? We basically need a
memory slice that says 'stuff on this slice only ever changes at
safepoints'.
Roman
Roman
More information about the shenandoah-dev
mailing list