Shenandoah WB fastpath and optimizations
Roman Kennke
rkennke at redhat.com
Tue Dec 19 13:03:25 UTC 2017
Without going deeper, maybe it's worth to do an optimization like I've
outlined in the Traversal GC thread? I.e. fold evac_in_progress checks
in blocks without safepoint? And generate WB-less blocks if possible?
Roman
> Comparing the Shenandoah performance on XmlValidation and disabled barriers, reveals an odd story.
> The accurate perfnorm profiling that normalizes the CPU counters to benchmark operations:
>
>
> Benchmark Mode Cnt Score Error Units
>
> # passive
> XV.test thrpt 10 236 ± 1 ops/min
> XV.test:CPI thrpt 10 0.417 ± 0 #/op
> XV.test:L1-dcache-load-misses thrpt 10 11605037 ± 191196 #/op
> XV.test:L1-dcache-loads thrpt 10 520038766 ± 6177479 #/op
> XV.test:L1-dcache-stores thrpt 10 198131386 ± 2044458 #/op
> XV.test:L1-icache-load-misses thrpt 10 4058561 ± 157045 #/op
> XV.test:LLC-load-misses thrpt 10 481808 ± 17320 #/op
> XV.test:LLC-loads thrpt 10 3478116 ± 78461 #/op
> XV.test:LLC-store-misses thrpt 10 51686 ± 2262 #/op
> XV.test:LLC-stores thrpt 10 262209 ± 15420 #/op
> XV.test:branch-misses thrpt 10 954476 ± 20287 #/op
> XV.test:branches thrpt 10 320735964 ± 1510799 #/op
> XV.test:cycles thrpt 10 691694314 ± 4159603 #/op
> XV.test:dTLB-load-misses thrpt 10 52266 ± 10707 #/op
> XV.test:dTLB-loads thrpt 10 515487335 ± 5540964 #/op
> XV.test:dTLB-store-misses thrpt 10 1692 ± 547 #/op
> XV.test:dTLB-stores thrpt 10 197639464 ± 2675693 #/op
> XV.test:iTLB-load-misses thrpt 10 10636 ± 5019 #/op
> XV.test:iTLB-loads thrpt 10 878417 ± 106475 #/op
> XV.test:instructions thrpt 10 1659286537 ± 8661844 #/op
>
> # passive, +ShenandoahWriteBarrier
> XV.test thrpt 10 206 ± 2.905 ops/min -14%
> XV.test:CPI thrpt 10 0.417 ± 0.004 #/op
> XV.test:L1-dcache-load-misses thrpt 10 12126323 ± 464131 #/op
> XV.test:L1-dcache-loads thrpt 10 609183240 ± 5857280 #/op +77..101M +17%
> XV.test:L1-dcache-stores thrpt 10 216852068 ± 2586890 #/op +14..23M +9%
> XV.test:L1-icache-load-misses thrpt 10 4600468 ± 252047 #/op
> XV.test:LLC-load-misses thrpt 10 504257 ± 28641 #/op
> XV.test:LLC-loads thrpt 10 3696029 ± 105743 #/op
> XV.test:LLC-store-misses thrpt 10 52340 ± 2107 #/op
> XV.test:LLC-stores thrpt 10 245865 ± 15167 #/op
> XV.test:branch-misses thrpt 10 1080985 ± 29069 #/op
> XV.test:branches thrpt 10 361296218 ± 2117561 #/op +36..44M +12%
> XV.test:cycles thrpt 10 790992629 ± 9312064 #/op
> XV.test:dTLB-load-misses thrpt 10 72138 ± 8381 #/op
> XV.test:dTLB-loads thrpt 10 606335138 ± 4969218 #/op
> XV.test:dTLB-store-misses thrpt 10 3452 ± 2327 #/op
> XV.test:dTLB-stores thrpt 10 216814757 ± 2316964 #/op
> XV.test:iTLB-load-misses thrpt 10 16967 ± 14388 #/op
> XV.test:iTLB-loads thrpt 10 1006270 ± 153479 #/op
> XV.test:instructions thrpt 10 1897746787 ± 10418938 #/op +220..257M +14%
>
>
> There are a few interesting observations here:
>
> *) Enabling Shenandoah WB on this workload is responsible for ~14% throughput hit. This is the
> impact of the WB fastpath, because the workload runs with "passive" that does not do any concurrent
> cycles, and thus never reaches the slowpath.
>
> Shenandoah WB fastpath is basically four instructions:
>
> movzbl 0x3d8(%rTLS), %rScratch ; read evac-in-progress
> test %rScratch, %rScratch
> jne EVAC-ENABLED-SLOW-PATH
> mov -0x8(%rObj), %rObj ; read barrier
>
> *) CPI numbers agree in both configurations, and the number of instructions had also grown +14%.
> This means the impact is due to larger code path, not some backend effect (like cache misses or such).
>
> *) If we treat the number of of additional branches as the number of WBs for the workload, then we
> have around 40M WB fastpaths for each benchmark op. This means we should have around 80M
> L1-dcache-loads coming from WB (one for reading TLS flag, and another for RB), and that seems to
> agree with the data, given quite large error bounds.
>
> *) What is weird is that we have ~18M excess *stores*, which are completely unaccounted by WBs.
>
> *) ...and to add the insult to injury, 4 insn per WB should add up to 160M excess insns, but
> instead we have around 240M.
>
> The profile is too flat to pinpoint the exact code shape where we lose some of these instructions.
> But this collateral evidence seems to imply WBs make some stores more probable (e.g. by breaking
> some optimizations?), and that is the cause for inflated insn and L1 store counts?
>
> Thoughts?
>
> Thanks,
> -Aleksey
>
> P.S. Looking at ShenandoahWriteBarrierNode::test_evacuation_in_progress, I see there is
> Op_MemBarAcquire node that is attached to control projection for both CmpI and Bool nodes from WB.
> Are these limiting the optimizations? Why do we need acquire there? This had originated from
> Roland's change rewrite that introduced shenandoah_pin_and_expand_barriers:
> http://hg.openjdk.java.net/shenandoah/jdk9/hotspot/rev/978d7601df14#l20.1137
>
More information about the shenandoah-dev
mailing list