Shenandoah WB fastpath and optimizations

Tue Dec 19 12:54:08 UTC 2017

Comparing the Shenandoah performance on XmlValidation and disabled barriers, reveals an odd story.
The accurate perfnorm profiling that normalizes the CPU counters to benchmark operations:

Benchmark                       Mode  Cnt        Score    Error    Units

# passive
XV.test                        thrpt   10         236 ±       1  ops/min
XV.test:CPI                    thrpt   10       0.417 ±       0     #/op
XV.test:L1-dcache-load-misses  thrpt   10    11605037 ±  191196     #/op
XV.test:L1-dcache-loads        thrpt   10   520038766 ± 6177479     #/op
XV.test:L1-dcache-stores       thrpt   10   198131386 ± 2044458     #/op
XV.test:L1-icache-load-misses  thrpt   10     4058561 ±  157045     #/op
XV.test:LLC-load-misses        thrpt   10      481808 ±   17320     #/op
XV.test:LLC-loads              thrpt   10     3478116 ±   78461     #/op
XV.test:LLC-store-misses       thrpt   10       51686 ±    2262     #/op
XV.test:LLC-stores             thrpt   10      262209 ±   15420     #/op
XV.test:branch-misses          thrpt   10      954476 ±   20287     #/op
XV.test:branches               thrpt   10   320735964 ± 1510799     #/op
XV.test:cycles                 thrpt   10   691694314 ± 4159603     #/op
XV.test:dTLB-load-misses       thrpt   10       52266 ±   10707     #/op
XV.test:dTLB-loads             thrpt   10   515487335 ± 5540964     #/op
XV.test:dTLB-store-misses      thrpt   10        1692 ±     547     #/op
XV.test:dTLB-stores            thrpt   10   197639464 ± 2675693     #/op
XV.test:iTLB-load-misses       thrpt   10       10636 ±    5019     #/op
XV.test:iTLB-loads             thrpt   10      878417 ±  106475     #/op
XV.test:instructions           thrpt   10  1659286537 ± 8661844     #/op

# passive, +ShenandoahWriteBarrier
XV.test                        thrpt   10         206 ±    2.905  ops/min             -14%
XV.test:CPI                    thrpt   10       0.417 ±    0.004     #/op
XV.test:L1-dcache-load-misses  thrpt   10    12126323 ±   464131     #/op
XV.test:L1-dcache-loads        thrpt   10   609183240 ±  5857280     #/op  +77..101M  +17%
XV.test:L1-dcache-stores       thrpt   10   216852068 ±  2586890     #/op  +14..23M    +9%
XV.test:L1-icache-load-misses  thrpt   10     4600468 ±   252047     #/op
XV.test:LLC-load-misses        thrpt   10      504257 ±    28641     #/op
XV.test:LLC-loads              thrpt   10     3696029 ±   105743     #/op
XV.test:LLC-store-misses       thrpt   10       52340 ±     2107     #/op
XV.test:LLC-stores             thrpt   10      245865 ±    15167     #/op
XV.test:branch-misses          thrpt   10     1080985 ±    29069     #/op
XV.test:branches               thrpt   10   361296218 ±  2117561     #/op  +36..44M   +12%
XV.test:cycles                 thrpt   10   790992629 ±  9312064     #/op
XV.test:dTLB-load-misses       thrpt   10       72138 ±     8381     #/op
XV.test:dTLB-loads             thrpt   10   606335138 ±  4969218     #/op
XV.test:dTLB-store-misses      thrpt   10        3452 ±     2327     #/op
XV.test:dTLB-stores            thrpt   10   216814757 ±  2316964     #/op
XV.test:iTLB-load-misses       thrpt   10       16967 ±    14388     #/op
XV.test:iTLB-loads             thrpt   10     1006270 ±   153479     #/op
XV.test:instructions           thrpt   10  1897746787 ± 10418938     #/op +220..257M  +14%

There are a few interesting observations here:

 *) Enabling Shenandoah WB on this workload is responsible for ~14% throughput hit. This is the
impact of the WB fastpath, because the workload runs with "passive" that does not do any concurrent
cycles, and thus never reaches the slowpath.

Shenandoah WB fastpath is basically four instructions:

   movzbl 0x3d8(%rTLS), %rScratch  ; read evac-in-progress
   test %rScratch, %rScratch
   jne EVAC-ENABLED-SLOW-PATH
   mov -0x8(%rObj), %rObj          ; read barrier

 *) CPI numbers agree in both configurations, and the number of instructions had also grown +14%.
This means the impact is due to larger code path, not some backend effect (like cache misses or such).

 *) If we treat the number of of additional branches as the number of WBs for the workload, then we
have around 40M WB fastpaths for each benchmark op. This means we should have around 80M
L1-dcache-loads coming from WB (one for reading TLS flag, and another for RB), and that seems to
agree with the data, given quite large error bounds.

 *) What is weird is that we have ~18M excess *stores*, which are completely unaccounted by WBs.

 *) ...and to add the insult to injury, 4 insn per WB should add up to 160M excess insns, but
instead we have around 240M.

The profile is too flat to pinpoint the exact code shape where we lose some of these instructions.
But this collateral evidence seems to imply WBs make some stores more probable (e.g. by breaking
some optimizations?), and that is the cause for inflated insn and L1 store counts?

Thoughts?

Thanks,
-Aleksey

P.S. Looking at ShenandoahWriteBarrierNode::test_evacuation_in_progress, I see there is
Op_MemBarAcquire node that is attached to control projection for both CmpI and Bool nodes from WB.
Are these limiting the optimizations? Why do we need acquire there? This had originated from
Roland's change rewrite that introduced shenandoah_pin_and_expand_barriers:
http://hg.openjdk.java.net/shenandoah/jdk9/hotspot/rev/978d7601df14#l20.1137