Perf: WB without RB on fastpath

Sat Jan 13 09:51:00 UTC 2018

The single flag change opens up an interesting opportunity for us: we can check for the GC state to
be zero, and that means no barriers are required whatsoever. So, instead of doing:

     testb $0x4, 0x3d8(TLS)
     jnz EVAC-IN-PROGRESS
     mov %r, -0x8(%r)
DONE:
     ...
(later)
EVAC-IN-PROGRESS:
     <test against cset>
     <jump to slowpath>

...we can do:

     cmpb $0x0, 0x3d8(TLS)
     jne NON-STABLE-HEAP
DONE:
     ...
(later)
NON-STABLE HEAP:
     test $0x4, 0x3d8(TLS)
     jz DONE
     <test against cset>
     <jump to slowpath>

So the fastpath is the same, we just test against different value. Slowpath gets a bit slower. The
performance improvement can be estimated with passive, -XX:+ShWB and -XX:(+|-)ShWriteBarrierRB.
Overnight runs translate to:

Compiler.compiler: +1.0%
Compiler.sunflow:  +1.2%
Compress:          +2.6%
CryptoSignVerify:  +0.3%
MpegAudio:         +1.9%
ScimarkLU.large:   +4.8%
ScimarkLU.small:   +9.5%
XmlTransform:      +1.6%
XmlValidation:     +2.5%

...and no regressions!

Roman mentions separately that Traversal GC does not require RB at all on fastpath, which seems to
be the special case of this generic optimization.

Thanks,
-Aleksey