[9] [8u40] RFR (M): 8059877: GWT branch frequencies pollution due to LF sharing

Fri Oct 10 19:08:00 UTC 2014

http://cr.openjdk.java.net/~vlivanov/8059877/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8059877

LambdaForm sharing introduces profile pollution in compiled LambdaForms. 
The most serious consequence is inlining distortion, which severely 
degrade peak performance. The main victim is guardWithTest (GWT) 
combinator.

Before LambdaForm sharing, inlining in GWT was affected by 2 aspects:
   - branch frequencies: never-taken branch is skipped;
   - target & fallback method handles (corresponding LFs: compiled vs 
interpreted): if method handle has been invoked < COMPILE_THRESHOLD 
times, LambdaForm.vmentry points to LF interpreter which is marked w/ 
@DontInline.

LambdaForm sharing breaks both aspects:
   - sharing of GWT LambdaForm pollutes branch profile;
   - sharing of LambdaForms used in target & fallback pollutes 
invocation counters.

I experimented w/ VM API to guide JIT-compiler using profiling 
information gathered on LambdaForm level [1], but decided to take safer 
route for now (8u40). JIT-compiler control approach looks promising, but 
I need more time to get rid of some performance artifacts it suffers 
from.

The proposed fix is to mimic behavior of fully customized LambdaForms.
When GWT is created, both target & fallback method handles are wrapped 
in a special reinoker, which blocks inlining (@DontInline on reinvoker's 
compiled LambdaForm). Once a wrapper is invoked more that 
DONT_INLINE_THRESHOLD times, it's LambdaForm is replaced with a regular 
reinvoker, which is fully transparent for the JIT and it inlines smoothly.

The downside of the chosen approach is that LambdaForm.updateForm() 
doesn't guarantee that all places where stale LambaForm is used see the 
update. If it is already part of some nmethod, it won't be invalidated 
and recompiled, but will be kept as is. It shouldn't be a problem, since 
DONT_INLINE_THRESHOLD is expected to be pretty low (~30), so only very 
rarely executed branches are affected.

The fix significantly improves peak performance w/ full LF sharing 
(USE_LF_EDITOR=true).

Octane/nashorn results [2] for:
   (1) USE_LF_EDITOR=false DONT_INLINE_THRESHOLD=0 (default for 8u40&9)
   (2) USE_LF_EDITOR=true  DONT_INLINE_THRESHOLD=0 (default for 8u40&9)
   (3) USE_LF_EDITOR=true  DONT_INLINE_THRESHOLD=30 (fixed version)

(1) & (2) correspond to default configurations (partial & full LF 
sharing respectively). (3) is the fixed version.

The fix recovers peak performance for:
  * Crypto:       ~255ms -> ~12ms;
  * DeltaBlue:     ~40ms ->  ~2ms;
  * Raytracer:     ~62ms ->  ~7ms;
  * EarleyBoyer:  ~160ms ->  ~22ms;
  * NavierStokes:  ~17ms ->  ~13ms;

2 subbenchmarks (Box2D & Gbemu) still has some regressions, but it's 
much better now:
    Box2D: ~48ms -> ~61ms  (w/o the fix: ~155ms)
    Gbemu: ~88ms -> ~116ms (w/o the fix: ~160ms)

Testing:
   tests: jck (api/java_lang/invoke), jdk/java/lang/invoke, 
jdk/java/util/streams, octane
  configurations: -ea -esa -Xverify:all
  + COMPILE_THRESHOLD={0,30} + USE_LF_EDITOR={false,true} + 
DONT_INLINE_THRESHOLD={0,30}

Thanks!

Best regards,
Vladimir Ivanov

[1] http://cr.openjdk.java.net/~vlivanov/profiling/
[2] http://cr.openjdk.java.net/~vlivanov/8059877/octane.txt