Shenandoah WB fastpath and optimizations

Tue Dec 19 13:03:25 UTC 2017

Without going deeper, maybe it's worth to do an optimization like I've 
outlined in the Traversal GC thread? I.e. fold evac_in_progress checks 
in blocks without safepoint? And generate WB-less blocks if possible?

Roman

> Comparing the Shenandoah performance on XmlValidation and disabled barriers, reveals an odd story.
> The accurate perfnorm profiling that normalizes the CPU counters to benchmark operations:
> 
> 
> Benchmark                       Mode  Cnt        Score    Error    Units
> 
> # passive
> XV.test                        thrpt   10         236 ±       1  ops/min
> XV.test:CPI                    thrpt   10       0.417 ±       0     #/op
> XV.test:L1-dcache-load-misses  thrpt   10    11605037 ±  191196     #/op
> XV.test:L1-dcache-loads        thrpt   10   520038766 ± 6177479     #/op
> XV.test:L1-dcache-stores       thrpt   10   198131386 ± 2044458     #/op
> XV.test:L1-icache-load-misses  thrpt   10     4058561 ±  157045     #/op
> XV.test:LLC-load-misses        thrpt   10      481808 ±   17320     #/op
> XV.test:LLC-loads              thrpt   10     3478116 ±   78461     #/op
> XV.test:LLC-store-misses       thrpt   10       51686 ±    2262     #/op
> XV.test:LLC-stores             thrpt   10      262209 ±   15420     #/op
> XV.test:branch-misses          thrpt   10      954476 ±   20287     #/op
> XV.test:branches               thrpt   10   320735964 ± 1510799     #/op
> XV.test:cycles                 thrpt   10   691694314 ± 4159603     #/op
> XV.test:dTLB-load-misses       thrpt   10       52266 ±   10707     #/op
> XV.test:dTLB-loads             thrpt   10   515487335 ± 5540964     #/op
> XV.test:dTLB-store-misses      thrpt   10        1692 ±     547     #/op
> XV.test:dTLB-stores            thrpt   10   197639464 ± 2675693     #/op
> XV.test:iTLB-load-misses       thrpt   10       10636 ±    5019     #/op
> XV.test:iTLB-loads             thrpt   10      878417 ±  106475     #/op
> XV.test:instructions           thrpt   10  1659286537 ± 8661844     #/op
> 
> # passive, +ShenandoahWriteBarrier
> XV.test                        thrpt   10         206 ±    2.905  ops/min             -14%
> XV.test:CPI                    thrpt   10       0.417 ±    0.004     #/op
> XV.test:L1-dcache-load-misses  thrpt   10    12126323 ±   464131     #/op
> XV.test:L1-dcache-loads        thrpt   10   609183240 ±  5857280     #/op  +77..101M  +17%
> XV.test:L1-dcache-stores       thrpt   10   216852068 ±  2586890     #/op  +14..23M    +9%
> XV.test:L1-icache-load-misses  thrpt   10     4600468 ±   252047     #/op
> XV.test:LLC-load-misses        thrpt   10      504257 ±    28641     #/op
> XV.test:LLC-loads              thrpt   10     3696029 ±   105743     #/op
> XV.test:LLC-store-misses       thrpt   10       52340 ±     2107     #/op
> XV.test:LLC-stores             thrpt   10      245865 ±    15167     #/op
> XV.test:branch-misses          thrpt   10     1080985 ±    29069     #/op
> XV.test:branches               thrpt   10   361296218 ±  2117561     #/op  +36..44M   +12%
> XV.test:cycles                 thrpt   10   790992629 ±  9312064     #/op
> XV.test:dTLB-load-misses       thrpt   10       72138 ±     8381     #/op
> XV.test:dTLB-loads             thrpt   10   606335138 ±  4969218     #/op
> XV.test:dTLB-store-misses      thrpt   10        3452 ±     2327     #/op
> XV.test:dTLB-stores            thrpt   10   216814757 ±  2316964     #/op
> XV.test:iTLB-load-misses       thrpt   10       16967 ±    14388     #/op
> XV.test:iTLB-loads             thrpt   10     1006270 ±   153479     #/op
> XV.test:instructions           thrpt   10  1897746787 ± 10418938     #/op +220..257M  +14%
> 
> 
> There are a few interesting observations here:
> 
>   *) Enabling Shenandoah WB on this workload is responsible for ~14% throughput hit. This is the
> impact of the WB fastpath, because the workload runs with "passive" that does not do any concurrent
> cycles, and thus never reaches the slowpath.
> 
> Shenandoah WB fastpath is basically four instructions:
> 
>     movzbl 0x3d8(%rTLS), %rScratch  ; read evac-in-progress
>     test %rScratch, %rScratch
>     jne EVAC-ENABLED-SLOW-PATH
>     mov -0x8(%rObj), %rObj          ; read barrier
> 
>   *) CPI numbers agree in both configurations, and the number of instructions had also grown +14%.
> This means the impact is due to larger code path, not some backend effect (like cache misses or such).
> 
>   *) If we treat the number of of additional branches as the number of WBs for the workload, then we
> have around 40M WB fastpaths for each benchmark op. This means we should have around 80M
> L1-dcache-loads coming from WB (one for reading TLS flag, and another for RB), and that seems to
> agree with the data, given quite large error bounds.
> 
>   *) What is weird is that we have ~18M excess *stores*, which are completely unaccounted by WBs.
> 
>   *) ...and to add the insult to injury, 4 insn per WB should add up to 160M excess insns, but
> instead we have around 240M.
> 
> The profile is too flat to pinpoint the exact code shape where we lose some of these instructions.
> But this collateral evidence seems to imply WBs make some stores more probable (e.g. by breaking
> some optimizations?), and that is the cause for inflated insn and L1 store counts?
> 
> Thoughts?
> 
> Thanks,
> -Aleksey
> 
> P.S. Looking at ShenandoahWriteBarrierNode::test_evacuation_in_progress, I see there is
> Op_MemBarAcquire node that is attached to control projection for both CmpI and Bool nodes from WB.
> Are these limiting the optimizations? Why do we need acquire there? This had originated from
> Roland's change rewrite that introduced shenandoah_pin_and_expand_barriers:
> http://hg.openjdk.java.net/shenandoah/jdk9/hotspot/rev/978d7601df14#l20.1137
>