RFR(L): 8186027: C2: loop strip mining

Tue Oct 24 23:02:46 UTC 2017

Roland,

Did you consider less intrusive approach by adding branch over SafePoint with masking on index variable?

   int mask = LoopStripMiningMask * inc; // simplified
   for (int i = start; i < stop; i += inc) {
      // body
      if (i & mask != 0) continue;
      safepoint;
   }

Or may be doing it inside .ad file in new SafePoint node implementation so that ideal graph is not affected.
I am concern that suggested changes may affect Range Check elimination (you changed limit to variable value/flag) in addition to complexity of changes which may affect stability of C2.

Thanks,
Vladimir

On 10/3/17 6:19 AM, Roland Westrelin wrote:
> 
> http://cr.openjdk.java.net/~roland/8186027/webrev.00/
> 
> This converts loop:
> 
> for (int i = start; i < stop; i += inc) {
>    // body
> }
> 
> to a loop nest:
> 
> i = start;
> if (i < stop) {
>    do {
>      int next = MIN(stop, i+LoopStripMiningIter*inc);
>      do {
>        // body
>        i += inc;
>      } while (i < next);
>      safepoint();
>    } while (i < stop);
> }
> 
> (It's actually:
> int next = MIN(stop - i, LoopStripMiningIter*inc) + i;
> to protect against overflows)
> 
> This should bring the best of running with UseCountedLoopSafepoints on
> and running with it off: low time to safepoint with little to no impact
> on throughput. That change was first pushed to the shenandoah repo
> several months ago and we've been running with it enabled since.
> 
> The command line argument LoopStripMiningIter is the number of
> iterations between safepoints. In practice, with an arbitrary
> LoopStripMiningIter=1000, we observe time to safepoint on par with the
> current -XX:+UseCountedLoopSafepoints and most performance regressions
> due to -XX:+UseCountedLoopSafepoints gone. The exception is when an
> inner counted loop runs for a low number of iterations on average (and
> the compiler doesn't have an upper bound on the number of iteration).
> 
> This is enabled on the command line with:
> -XX:+UseCountedLoopSafepoints -XX:LoopStripMiningIter=1000
> 
> In PhaseIdealLoop::is_counted_loop(), when loop strip mining is enabled,
> for an inner loop, the compiler builds a skeleton outer loop around the
> the counted loop. The outer loop is kept as simple as possible so
> required adjustments to the existing loop optimizations are not too
> intrusive. The reason the outer loop is inserted early in the
> optimization process is so that optimizations are not disrupted: an
> alternate implementation could have kept the safepoint in the counted
> loop until loop opts are over and then only have added the outer loop
> and moved the safepoint to the outer loop. That would have prevented
> nodes that are referenced in the safepoint to be sunk out of loop for
> instance.
> 
> The outer loop is a LoopNode with a backedge to a loop exit test and a
> safepoint. The loop exit test is a CmpI with a new Opaque5Node. The
> skeleton loop is populated with all required Phis after loop opts are
> over during macro expansion. At that point only, the loop exit tests are
> adjusted so the inner loop runs for at most LoopStripMiningIter. If the
> compiler can prove the inner loop runs for no more than
> LoopStripMiningIter then during macro expansion, the outer loop is
> removed. The safepoint is removed only if the inner loop executes for
> less than LoopStripMiningIterShortLoop so that if there are several
> counted loops in a raw, we still poll for safepoints regularly.
> 
> Until macro expansion, there can be only a few extra nodes in the outer
> loop: nodes that would have sunk out of the inner loop and be kept in
> the outer loop by the safepoint.
> 
> PhaseIdealLoop::clone_loop() which is used by most loop opts has now
> several ways of cloning a counted loop. For loop unswitching, both inner
> and outer loops need to be cloned. For unrolling, only the inner loop
> needs to be cloned. For pre/post loops insertion, only the inner loop
> needs to be cloned but the control flow must connect one of the inner
> loop copies to the outer loop of the other copy.
> 
> Beyond verifying performance results with the usual benchmarks, when I
> implemented that change, I wrote test cases for (hopefully) every loop
> optimization and verified by inspection of the generated code that the
> loop opt triggers correct with loop strip mining.
> 
> Roland.
>