RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations

Wed Feb 9 07:29:09 UTC 2022

On Fri, 4 Feb 2022 14:41:55 GMT, Roland Westrelin <roland at openjdk.org> wrote:

> The bytecode of the 2 methods of the benchmark is structured
> differently: loopsWithSharedLocal(), the slowest one, has multiple
> backedges with a single head while loopsWithScopedLocal() has a single
> backedge and all the paths in the loop body merge before the
> backedge. loopsWithSharedLocal() has its head cloned which results in
> a 2 loops loop nest.
> 
> loopsWithSharedLocal() is slow when 2 of the backedges are most
> commonly taken with one taken only 3 times as often as the other
> one. So a thread executing that code only runs the inner loop for a
> few iterations before exiting it and executing the outer loop. I think
> what happens is that any time the inner loop is entered, some
> predicates are executed and the overhead of the setup of loop strip
> mining (if it's enabled) has to be paid. Also, if iteration
> splitting/unrolling was applied, the main loop is likely never
> executed and all time is spent in the pre/post loops where potentially
> some range checks remain.
> 
> The fix I propose is that ciTypeFlow, when it clone heads, not only
> rewires the most frequent loop but also all this other frequent loops
> that share the same head. loopsWithSharedLocal() and
> loopsWithScopedLocal() are then fairly similar once c2 parses them.
> 
> Without the patch I measure:
> 
> LoopLocals.loopsWithScopedLocal      mixed  avgt    5  1108.874 ± 250.463  ns/op
> LoopLocals.loopsWithSharedLocal      mixed  avgt    5  1575.665 ±  70.372  ns/op
> 
> with it:
> 
> LoopLocals.loopsWithScopedLocal      mixed  avgt    5  1108.180 ± 245.873  ns/op
> LoopLocals.loopsWithSharedLocal      mixed  avgt    5  1234.665 ± 157.912  ns/op
> 
> But this patch also causes a regression when running one of the
> benchmarks added by 8278518. From:
> 
> SharedLoopHeader.sharedHeader  avgt    5  505.993 ± 44.126  ns/op
> 
> to:
> 
> SharedLoopHeader.sharedHeader  avgt    5  724.253 ± 1.664  ns/op
> 
> The hot method of this benchmark used to be compiled with 2 loops, the
> inner one a counted loop. With the patch, it's now compiled with a
> single one which can't be converted into a counted loop because the
> loop variable is incremented by a different amount along the 2 paths
> in the loop body. What I propose to fix this is to add a new loop
> transformation that detects that, because of a merge point, a loop
> can't be turned into a counted loop and transforms it into 2
> loops. The benchmark performs better with this:
> 
> SharedLoopHeader.sharedHeader  avgt    5  567.150 ± 6.120  ns/op
> 
> Not quite on par with the previous score but AFAICT this is due to
> code generation not being as good (the loop head can't be aligned in
> particular).
> 
> In short, I propose:
> 
> - changing ciTypeFlow so that, when it pays off, a loop with
> multiple backedges is compiled as a single loop with a merge point in
> the loop body
> 
> - adding a new loop transformation so that, when it pays off, a loop
> with a merge point in the loop body is converted into a 2 loops loop
> nest, essentially the opposite transformation.

I executed some testing and I'm seeing massive (> 700) failures due to the `negative trip count?` assert. For example, `compiler/c2/Test6603011.java` with `-server -Xmixed`:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (.../open/src/hotspot/share/opto/loopPredicate.cpp:1447), pid=11690, tid=11708
#  assert(!follow_branches || loop_trip_cnt >= 0) failed: negative trip count?
#
# JRE version: Java(TM) SE Runtime Environment (19.0) (fastdebug build 19-internal+0-2022-02-08-1039057.tobias.hartmann.jdk2)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 19-internal+0-2022-02-08-1039057.tobias.hartmann.jdk2, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x135c470]  PhaseIdealLoop::loop_predication_impl(IdealLoopTree*) [clone .part.0]+0xf70

Current CompileTask:
C2:  10125 4309    b  4       jdk.internal.org.objectweb.asm.ClassReader::readUtf (161 bytes)

Stack: [0x00007f83f4c3a000,0x00007f83f4d3b000],  sp=0x00007f83f4d34c60,  free space=1003k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x135c470]  PhaseIdealLoop::loop_predication_impl(IdealLoopTree*) [clone .part.0]+0xf70
V  [libjvm.so+0x135c739]  IdealLoopTree::loop_predication(PhaseIdealLoop*)+0x109
V  [libjvm.so+0x13a4b47]  PhaseIdealLoop::build_and_optimize(LoopOptsMode)+0x1327
V  [libjvm.so+0xa9cbea]  PhaseIdealLoop::optimize(PhaseIterGVN&, LoopOptsMode)+0x28a
V  [libjvm.so+0xa99513]  Compile::Optimize()+0x1193
V  [libjvm.so+0xa9b5f5]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1575
V  [libjvm.so+0x8b1e29]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x669
V  [libjvm.so+0xaab4d8]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0xc88
V  [libjvm.so+0xaac2b8]  CompileBroker::compiler_thread_loop()+0x668
V  [libjvm.so+0x19382da]  JavaThread::thread_main_inner()+0x25a
V  [libjvm.so+0x1940620]  Thread::call_run()+0x100
V  [libjvm.so+0x1623cd4]  thread_native_entry(Thread*)+0x104

-------------

PR: https://git.openjdk.java.net/jdk/pull/7352