RFR(L): 8186027: C2: loop strip mining

Tue Oct 3 13:19:47 UTC 2017

http://cr.openjdk.java.net/~roland/8186027/webrev.00/

This converts loop:

for (int i = start; i < stop; i += inc) {
  // body
}

to a loop nest:

i = start;
if (i < stop) {
  do {
    int next = MIN(stop, i+LoopStripMiningIter*inc);
    do {
      // body
      i += inc;
    } while (i < next);
    safepoint();
  } while (i < stop);
}

(It's actually:
int next = MIN(stop - i, LoopStripMiningIter*inc) + i;
to protect against overflows)

This should bring the best of running with UseCountedLoopSafepoints on
and running with it off: low time to safepoint with little to no impact
on throughput. That change was first pushed to the shenandoah repo
several months ago and we've been running with it enabled since.

The command line argument LoopStripMiningIter is the number of
iterations between safepoints. In practice, with an arbitrary
LoopStripMiningIter=1000, we observe time to safepoint on par with the
current -XX:+UseCountedLoopSafepoints and most performance regressions
due to -XX:+UseCountedLoopSafepoints gone. The exception is when an
inner counted loop runs for a low number of iterations on average (and
the compiler doesn't have an upper bound on the number of iteration).

This is enabled on the command line with:
-XX:+UseCountedLoopSafepoints -XX:LoopStripMiningIter=1000

In PhaseIdealLoop::is_counted_loop(), when loop strip mining is enabled,
for an inner loop, the compiler builds a skeleton outer loop around the
the counted loop. The outer loop is kept as simple as possible so
required adjustments to the existing loop optimizations are not too
intrusive. The reason the outer loop is inserted early in the
optimization process is so that optimizations are not disrupted: an
alternate implementation could have kept the safepoint in the counted
loop until loop opts are over and then only have added the outer loop
and moved the safepoint to the outer loop. That would have prevented
nodes that are referenced in the safepoint to be sunk out of loop for
instance.

The outer loop is a LoopNode with a backedge to a loop exit test and a
safepoint. The loop exit test is a CmpI with a new Opaque5Node. The
skeleton loop is populated with all required Phis after loop opts are
over during macro expansion. At that point only, the loop exit tests are
adjusted so the inner loop runs for at most LoopStripMiningIter. If the
compiler can prove the inner loop runs for no more than
LoopStripMiningIter then during macro expansion, the outer loop is
removed. The safepoint is removed only if the inner loop executes for
less than LoopStripMiningIterShortLoop so that if there are several
counted loops in a raw, we still poll for safepoints regularly.

Until macro expansion, there can be only a few extra nodes in the outer
loop: nodes that would have sunk out of the inner loop and be kept in
the outer loop by the safepoint.

PhaseIdealLoop::clone_loop() which is used by most loop opts has now
several ways of cloning a counted loop. For loop unswitching, both inner
and outer loops need to be cloned. For unrolling, only the inner loop
needs to be cloned. For pre/post loops insertion, only the inner loop
needs to be cloned but the control flow must connect one of the inner
loop copies to the outer loop of the other copy.

Beyond verifying performance results with the usual benchmarks, when I
implemented that change, I wrote test cases for (hopefully) every loop
optimization and verified by inspection of the generated code that the
loop opt triggers correct with loop strip mining.

Roland.