Shenandoah WB and tableswitch
Aleksey Shipilev
shade at redhat.com
Tue Dec 19 18:11:00 UTC 2017
I think I have zeroed in on at least one issue with WBs. Successively dissecting the problematic
workloads first yields the workload like this, derived from UTF-8 decoders in JDK:
http://icedtea.classpath.org/hg/gc-bench/file/d04b4bbbc39f/src/main/java/org/openjdk/gcbench/wip/WriteBarrierUTF8Scan.java
...and then a minimal version of the same:
http://icedtea.classpath.org/hg/gc-bench/file/d04b4bbbc39f/src/main/java/org/openjdk/gcbench/wip/WriteBarrierTableSwitch.java
Now, running it with current sh/jdk10 yields interesting results.
First, running with C1:
------------------------------------------------------------------------------
Benchmark (size) Mode Cnt Score Error Units
# Parallel, -XX:TieredStopAtLevel=1
WriteBarrierTableSwitch.common 1000 avgt 15 2137.543 ± 9.084 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 2260.783 ± 6.355 ns/op
# Shenandoah passive, -XX:TieredStopAtLevel=1
WriteBarrierTableSwitch.common 1000 avgt 15 2144.273 ± 7.565 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 2270.335 ± 6.433 ns/op
# Shenandoah passive, -XX:TieredStopAtLevel=1, -XX:+ShenandoahWriteBarrier
WriteBarrierTableSwitch.common 1000 avgt 15 2613.767 ± 29.567 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 2670.697 ± 8.822 ns/op
------------------------------------------------------------------------------
Everything seems to be in order: passive Shenandoah is as fast as Parallel, and enabling WBs makes
everything consistently slower, because there are writes to cbuf array all the time.
With C2 the picture gets murkier:
------------------------------------------------------------------------------
Benchmark (size) Mode Cnt Score Error Units
# Parallel, -XX:-TieredCompilation
WriteBarrierTableSwitch.common 1000 avgt 15 1518.773 ± 3.962 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 2302.127 ± 49.734 ns/op
# Shenandoah passive, -XX:-TieredCompilation
WriteBarrierTableSwitch.common 1000 avgt 15 1575.086 ± 4.616 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 2832.982 ± 70.375 ns/op
# Shenandoah passive, -XX:-TieredCompilation, -XX:+ShenandoahWriteBarrier
WriteBarrierTableSwitch.common 1000 avgt 15 1499.475 ± 38.896 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 3135.664 ± 11.811 ns/op
--------------------------------------------------------------------------------
First of all, why does Shenandoah passive perform worse than Parallel even without barriers? That
one is explained by interaction with counted loop safepoints / loop strip mining, see:
------------------------------------------------------------------------------
Benchmark (size) Mode Cnt Score Error Units
# Shenandoah passive, -XX:-TieredCompilation, -XX:-UseCountedLoopSafepoints
WriteBarrierTableSwitch.common 1000 avgt 15 1526.821 ± 7.644 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 2327.750 ± 73.020 ns/op
------------------------------------------------------------------------------
It is still weird to see CLS/LSM pessimize this case so much.
Then, why does "separate" regresses when WB is enabled, and "common" does not regress? Perfasm
suggests that in "common" case we are able to hoist the WB out of the loop, and this is why there is
no +WB impact. We failed to do the same with "separate", for some reason. Disabling CLS/LSM helps
just a little:
------------------------------------------------------------------------------
Benchmark (size) Mode Cnt Score Error Units
WriteBarrierTableSwitch.common 1000 avgt 15 1535.884 ± 21.498 ns/op
WriteBarrierTableSwitch.separate 1000 avgt 15 2876.315 ± 43.569 ns/op
------------------------------------------------------------------------------
This pinpoints at least one problem with WBs that impact Stringy/UTF-8-y code we have in benchmarks.
Thanks,
-Aleksey
More information about the shenandoah-dev
mailing list