Shenandoah WB fastpath and optimizations
Aleksey Shipilev
shade at redhat.com
Tue Dec 19 12:54:08 UTC 2017
Comparing the Shenandoah performance on XmlValidation and disabled barriers, reveals an odd story.
The accurate perfnorm profiling that normalizes the CPU counters to benchmark operations:
Benchmark Mode Cnt Score Error Units
# passive
XV.test thrpt 10 236 ± 1 ops/min
XV.test:CPI thrpt 10 0.417 ± 0 #/op
XV.test:L1-dcache-load-misses thrpt 10 11605037 ± 191196 #/op
XV.test:L1-dcache-loads thrpt 10 520038766 ± 6177479 #/op
XV.test:L1-dcache-stores thrpt 10 198131386 ± 2044458 #/op
XV.test:L1-icache-load-misses thrpt 10 4058561 ± 157045 #/op
XV.test:LLC-load-misses thrpt 10 481808 ± 17320 #/op
XV.test:LLC-loads thrpt 10 3478116 ± 78461 #/op
XV.test:LLC-store-misses thrpt 10 51686 ± 2262 #/op
XV.test:LLC-stores thrpt 10 262209 ± 15420 #/op
XV.test:branch-misses thrpt 10 954476 ± 20287 #/op
XV.test:branches thrpt 10 320735964 ± 1510799 #/op
XV.test:cycles thrpt 10 691694314 ± 4159603 #/op
XV.test:dTLB-load-misses thrpt 10 52266 ± 10707 #/op
XV.test:dTLB-loads thrpt 10 515487335 ± 5540964 #/op
XV.test:dTLB-store-misses thrpt 10 1692 ± 547 #/op
XV.test:dTLB-stores thrpt 10 197639464 ± 2675693 #/op
XV.test:iTLB-load-misses thrpt 10 10636 ± 5019 #/op
XV.test:iTLB-loads thrpt 10 878417 ± 106475 #/op
XV.test:instructions thrpt 10 1659286537 ± 8661844 #/op
# passive, +ShenandoahWriteBarrier
XV.test thrpt 10 206 ± 2.905 ops/min -14%
XV.test:CPI thrpt 10 0.417 ± 0.004 #/op
XV.test:L1-dcache-load-misses thrpt 10 12126323 ± 464131 #/op
XV.test:L1-dcache-loads thrpt 10 609183240 ± 5857280 #/op +77..101M +17%
XV.test:L1-dcache-stores thrpt 10 216852068 ± 2586890 #/op +14..23M +9%
XV.test:L1-icache-load-misses thrpt 10 4600468 ± 252047 #/op
XV.test:LLC-load-misses thrpt 10 504257 ± 28641 #/op
XV.test:LLC-loads thrpt 10 3696029 ± 105743 #/op
XV.test:LLC-store-misses thrpt 10 52340 ± 2107 #/op
XV.test:LLC-stores thrpt 10 245865 ± 15167 #/op
XV.test:branch-misses thrpt 10 1080985 ± 29069 #/op
XV.test:branches thrpt 10 361296218 ± 2117561 #/op +36..44M +12%
XV.test:cycles thrpt 10 790992629 ± 9312064 #/op
XV.test:dTLB-load-misses thrpt 10 72138 ± 8381 #/op
XV.test:dTLB-loads thrpt 10 606335138 ± 4969218 #/op
XV.test:dTLB-store-misses thrpt 10 3452 ± 2327 #/op
XV.test:dTLB-stores thrpt 10 216814757 ± 2316964 #/op
XV.test:iTLB-load-misses thrpt 10 16967 ± 14388 #/op
XV.test:iTLB-loads thrpt 10 1006270 ± 153479 #/op
XV.test:instructions thrpt 10 1897746787 ± 10418938 #/op +220..257M +14%
There are a few interesting observations here:
*) Enabling Shenandoah WB on this workload is responsible for ~14% throughput hit. This is the
impact of the WB fastpath, because the workload runs with "passive" that does not do any concurrent
cycles, and thus never reaches the slowpath.
Shenandoah WB fastpath is basically four instructions:
movzbl 0x3d8(%rTLS), %rScratch ; read evac-in-progress
test %rScratch, %rScratch
jne EVAC-ENABLED-SLOW-PATH
mov -0x8(%rObj), %rObj ; read barrier
*) CPI numbers agree in both configurations, and the number of instructions had also grown +14%.
This means the impact is due to larger code path, not some backend effect (like cache misses or such).
*) If we treat the number of of additional branches as the number of WBs for the workload, then we
have around 40M WB fastpaths for each benchmark op. This means we should have around 80M
L1-dcache-loads coming from WB (one for reading TLS flag, and another for RB), and that seems to
agree with the data, given quite large error bounds.
*) What is weird is that we have ~18M excess *stores*, which are completely unaccounted by WBs.
*) ...and to add the insult to injury, 4 insn per WB should add up to 160M excess insns, but
instead we have around 240M.
The profile is too flat to pinpoint the exact code shape where we lose some of these instructions.
But this collateral evidence seems to imply WBs make some stores more probable (e.g. by breaking
some optimizations?), and that is the cause for inflated insn and L1 store counts?
Thoughts?
Thanks,
-Aleksey
P.S. Looking at ShenandoahWriteBarrierNode::test_evacuation_in_progress, I see there is
Op_MemBarAcquire node that is attached to control projection for both CmpI and Bool nodes from WB.
Are these limiting the optimizations? Why do we need acquire there? This had originated from
Roland's change rewrite that introduced shenandoah_pin_and_expand_barriers:
http://hg.openjdk.java.net/shenandoah/jdk9/hotspot/rev/978d7601df14#l20.1137
More information about the shenandoah-dev
mailing list