RFR: 8279888: Local variable independently used by multiple loops can interfere with loop optimizations

Fri Feb 4 14:54:35 UTC 2022

The bytecode of the 2 methods of the benchmark is structured
differently: loopsWithSharedLocal(), the slowest one, has multiple
backedges with a single head while loopsWithScopedLocal() has a single
backedge and all the paths in the loop body merge before the
backedge. loopsWithSharedLocal() has its head cloned which results in
a 2 loops loop nest.

loopsWithSharedLocal() is slow when 2 of the backedges are most
commonly taken with one taken only 3 times as often as the other
one. So a thread executing that code only runs the inner loop for a
few iterations before exiting it and executing the outer loop. I think
what happens is that any time the inner loop is entered, some
predicates are executed and the overhead of the setup of loop strip
mining (if it's enabled) has to be paid. Also, if iteration
splitting/unrolling was applied, the main loop is likely never
executed and all time is spent in the pre/post loops where potentially
some range checks remain.

The fix I propose is that ciTypeFlow, when it clone heads, not only
rewires the most frequent loop but also all this other frequent loops
that share the same head. loopsWithSharedLocal() and
loopsWithScopedLocal() are then fairly similar once c2 parses them.

Without the patch I measure:

LoopLocals.loopsWithScopedLocal      mixed  avgt    5  1108.874 ± 250.463  ns/op
LoopLocals.loopsWithSharedLocal      mixed  avgt    5  1575.665 ±  70.372  ns/op

with it:

LoopLocals.loopsWithScopedLocal      mixed  avgt    5  1108.180 ± 245.873  ns/op
LoopLocals.loopsWithSharedLocal      mixed  avgt    5  1234.665 ± 157.912  ns/op

But this patch also causes a regression when running one of the
benchmarks added by 8278518. From:

SharedLoopHeader.sharedHeader  avgt    5  505.993 ± 44.126  ns/op

to:

SharedLoopHeader.sharedHeader  avgt    5  724.253 ± 1.664  ns/op

The hot method of this benchmark used to be compiled with 2 loops, the
inner one a counted loop. With the patch, it's now compiled with a
single one which can't be converted into a counted loop because the
loop variable is incremented by a different amount along the 2 paths
in the loop body. What I propose to fix this is to add a new loop
transformation that detects that, because of a merge point, a loop
can't be turned into a counted loop and transforms it into 2
loops. The benchmark performs better with this:

SharedLoopHeader.sharedHeader  avgt    5  567.150 ± 6.120  ns/op

Not quite on par with the previous score but AFAICT this is due to
code generation not being as good (the loop head can't be aligned in
particular).

In short, I propose:

- changing ciTypeFlow so that, when it pays off, a loop with
multiple backedges is compiled as a single loop with a merge point in
the loop body

- adding a new loop transformation so that, when it pays off, a loop
with a merge point in the loop body is converted into a 2 loops loop
nest, essentially the opposite transformation.

-------------

Commit messages:
 - whitespaces
 - fix & test

Changes: https://git.openjdk.java.net/jdk/pull/7352/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7352&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8279888
  Stats: 1081 lines in 9 files changed: 781 ins; 132 del; 168 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7352.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7352/head:pull/7352

PR: https://git.openjdk.java.net/jdk/pull/7352