RFR(M): 8223051: support loops with long (64b) trip counts

Fri May 1 01:36:53 UTC 2020

A thousand thanks for taking this on!

On Apr 30, 2020, at 12:45 AM, Roland Westrelin <rwestrel at redhat.com> wrote:
> 
> 
> https://bugs.openjdk.java.net/browse/JDK-8223051
> http://cr.openjdk.java.net/~roland/8223051/webrev.00/
> 
> This transforms a long counted loop into a strip mined loop nest. That
> is roughly:
> 
> for (long l = long_start; l < long_stop; l += long_stride) {
> }
> 
> into
> 
> for (long l = long_start; l < long_stop; ) {
>  int int_stride = (int)long_stride;
>  int int_stop = MIN(long_stop - l, max_jint - int_stride);
>  l += int_stop;
>  for (int i = 0; i < int_stop; i += int_stride) {
>  }
> }

It would be good to put this pseudo-code above the definition
of PhaseIdealLoop::is_long_counted_loop, but with names
adjusted to match the code.

Let me sketch it in terms of the names in the code:

L: for (long phi = init; phi < limit; phi += stride) {
   // phi := Phi(L, init, phi + stride)
   … use phi and (phi + stride) …
}

==transform=>

const long inner_iters_limit = INT_MAX - stride;
assert(stride <= inner_iters_limit);  // else deopt
assert(limit + stride <= LONG_MAX);  // else deopt
L1: for (long phi1 = init; phi1 < limit; phi1 += stride) {
   // phi1 := Phi(L1, init, phi1 + stride)
   long inner_iters_max = MAX(0, limit + stride - phi1);
   long inner_iters_actual = MIN(inner_iters_max, inner_iters_limit);
   L2: for (int phi2 = 0; phi2 < inner_iters_actual; phi2 += stride) {
      … use (phi1 + phi2) and (phi1 + phi2 + stride) …
   }
}

> This is implemented as a separate transformation from loop strip mining
> of JDK-8186027. I used the logic from JDK-8186027 as inspiration but
> it's really quite different.

Thank you for pursuing it!

> If JDK-8186027's loop strip mining is enabled the loop nest above can be
> further transformed into:
> 
> for (long l = long_start; l < long_stop; ) {
>  for (int i = 0; i < int_stop; i += int_stride) {
>    for (int j = i; j < LoopStripMiningIter; j+= int_stride) {
>    }
>  }
> }

Nice.  This is worth a stress test; I think my comments below
set up the conditions for allowing all three loop layers to be
non-trivial, in a stress test.

> I refactored the code of PhaseIdealLoop::is_counted_loop() so it was
> straightforward to add a PhaseIdealLoop::is_long_counted_loop() that
> shares some logic with PhaseIdealLoop::is_counted_loop().

Thanks.  I have suggested a few more points of refactoring below.

> is_long_counted_loop() starts by looking at the shape of the loop and if
> its shape is that of a counted loop with a long induction variable, then
> an outer loop is added with a long induction variable. An int induction
> variable is constructed for the inner loop. At this point the loop nest
> is only partially constructed.
> 
> is_long_counted_loop() then attempts the conversion of the inner loop
> into an int counted loop with a call to is_counted_loop(). If that fails
> for some rare corner case, is_long_counted_loop() backs off and
> transforms the loop nest back so it's a single long loop again.

(See below about an extra diagnostic counter for this step.)

> If the inner loop is successfully converted into a counted loop,
> is_long_counted_loop() finishes building the loop nest. This is
> different from JDK-8186027's loop strip mining which builds the loop
> nest in 2 phases: first a skeleton outer loop and after loop opts, the
> fully built loop nest.
> 
> I also added stressing code that turns:
> 
> for (int i = int_start; i < int_stop; i += int_stride) {
> }
> 
> into:
> 
> for (long l = (long)int_start; l < (long)int_stop; l += (long)int_stride) {
> }
> 
> that can then be converted into the loop nest above. The reason for this
> is that I was concerned long loops were too uncommon in the wild for
> this change to be properly tested.

Very good move!

> I had to change the asserts in loopopts.cpp, because all nodes that are
> added when the loop nest is constructed have the same dom depth.

See request below for a small comment.
> 
> This change doesn't handle RCE. I'll work on that next.

Thank you.  The sort of RCE I hope to get to, eventually, is something
which performs range checks on long values (not just int values), and
still allows iteration range splitting.  So:

L: for (long phi = init; phi < limit; phi += stride) {
   // phi := Phi(L, init, phi + stride)
   if (phi >= 0 && phi < max) {
     … use phi …
   }
}

==transform=>

…
L1: for (long phi1 = init; phi1 < limit; phi1 += stride) {
   // phi1 := Phi(L1, init, phi1 + stride)
   …
   int phi2 = 0;
   if (phi1 + phi2 < 0) {
      L2_PRE: …
   }
   assert((phi1 + phi2) >= 0);
   L2: for (; phi2 < MIN(inner_iters_actual, phi1 - max); phi2 += stride) {
      … use phi, without range check …
   }
   if (phi1 + phi2 >= max) {
      L2_POST: …
   }
}

Now for specific comments:

This code, which is the use of clone_loop_predicates with the new
boolean, is adequately documented, but I’d prefer to see the boolean
even more clearly called out:

+  // Keep the outer loop below the existing predicates (some
+  // predicates may have been added already) and clone non-concrete
+  // predicates between the int loop head and the long loop head.
++ const bool not_concrete = false;
++  Node* predicate_proj = clone_loop_predicates(entry_control, outer_head, true, false, 0, NULL, not_concrete);

In this webrev I didn’t see a discussion of “what does concrete
mean”; I assume it is elsewhere, and that I would know it if I were
familiar with the loop optimizations (which very few people are!).
Maybe add:
+ // See 'GraphKit::add_empty_predicates'.

(Is a concrete predicate one which was inserted above an empty
place-holder by a predication transformation?  In that case, I
might wish to call those “predication tests”, or some other term
tied to the name of a specific optimization transform,  rather
than “concrete predicates”.)

I have a similar comment about the new ‘update_body’ flag.
For that, I suggest writing something with an explanatory
comment near the top of blocks of code which need it:

++ // Although this code creates new loop structure, it does
++ // not change the loop nest, because the nest itself is only
++ // partially constructed at this point.
++ const bool same_loop = false;  // don’t update loop bodies
…
++  register_control(outer_le, outer_ilt, iffalse, same_loop);
++  register_control(outer_ift, outer_ilt, outer_le, same_loop);

The point of this trick of using a symbolic name instead of a
boolean constant is (a) to make the use of it a little easier to
follow in the function call, and (b) to allow the declaration
of the constant to carry its own little load of documentation.

(BTW I find it slightly surprising when an optional flag defaults
to true rather than false; looks like you are doing the opposite
which I suppose is a reasonable convention also.)

In loopopts.cpp I suggest adding a very brief comment saying
when dom equality can occur.  (In both places, I guess.)  It might
help an integrator at some point, since the change is so isolated
and textually simple (without a comment).

That’s a nice job refactoring.  C2 has notoriously long methods,
and this makes some of them incrementally more readable.
If you went to some trouble to make the webrev diff smaller,
thank you.  It’s hard to refactor blocks out of mega-functions
without extra diff noise.  That said, ‘is_long_counted_loop’ is
a new mega-function.  I hope it doesn’t have to be refactored
any time soon.  I think you did the right thing here, even
though it’s possible to imagine more refactoring.

I think you inserted the check to LoopStripMiningIter twice.
You should probably pick one copy to remove.

Suggest adding this comment:

+  if (cmp == NULL || cmp->Opcode() != Op_CmpI) {
+    return false;   ++ // Avoid pointer & float & 64-bit compares

And a similar one that says “32-bit compares”.

Suggest adding an assert to loop_iv_stride:

++ assert(incr->Opcode() == Op_AddI || incr->Opcode() == Op_AddL, "caller resp.");

When refactoring mega-functions, I think it helps sometimes to add
such asserts to make it clear who is responsible for checking what.

It would be nice to add an extra filter step to loop_iv_incr, so it always
returns an add, but that would require the refactoring to touch the
truncation detector more than you did, so I think that’s OK.  You could
just add a comment to loop_iv_incr saying, “caller must ensure that
this is an increment of the expected kind”.  Or not; that’s pretty
obviously true.  I’m more sure that the previously suggested assert
is helpful.

Now for some arithmetic.  I think this test is not obviously correct:

+  if (ABS(stride_con) >= max_jint) {

You need an unsigned comparison here, or some other dodge, to avoid
the negative vibes from ABS(min_jlong).  I think the best thing to do is
add an extra overflow check which is easy to read, rather than checking
for that one extra case:

++  if (stride_con != (jint)stride_con || ABS(stride_con) >= max_jint) {

I think you should also check for “stride_con == 0” here.  The 32-bit
code checks for zero strides but I don’t see it here.  Perhaps that would
fold up later after the 32-bit loop is created?  But it seems tidier to keep
the 32-bit and 64-bit code as parallel as possible.

Also, I think this test on stride belongs after you increment _long_loops,
not before.  A long loop with a really big stride is, after all, still a long loop.

BTW, this should be a jlong, not a plain C long:

++  jlong stride_con = stride->get_long();

The following logic is hard for me to prove correct, and I would prefer it
to miss a few valid cases rather in a simpler form:
+  if (phi_incr != NULL && iters_limit <= ABS(stride_con)) {

I suggest either detuning the check in its current place:
++  if (iters_limit <= ABS(stride_con)) {

Or else move the check into the following conditional:
+  // if the loop exit test is on the IV before it is incremented:
+  // i < limit, we transform the exit test so it is performed on the
+  // exit test after it is incremented: i + stride < limit + stride.
+  // We need limit + stride to not overflow.
+  if (phi_incr != NULL) {
++  if (iters_limit <= ABS(stride_con)) { ++ return false;  ++ }

It helped me to associate this block with the later introduction
of “adjusted_limit”; suggest:

++  // We need limit + stride to not overflow.  See adjusted_limit below.

Also, the immediately following range check logic is very odd,
and doesn’t correspond to anything in the 32-bit loop code.
What you appear to be doing here is checking how limit_t
is going to behave with respect to the stride and there are
three possibilities:  It can sometimes cause 64-bit overflow,
or it cannot cause 64-bit overflow, or it *must* cause 64-bit
overflow.  The two parts of this check are widely separately
and not obviously connected, although they are linked by
a common task of predicate insertion.

I suggest making it more readable by writing a range-check
helper function, actually two of them:

// Return 0 if it won't overflow, -1 if it must overflow, and 1 otherwise.
static int check_stride_overflow(jint stride_con, const TypeInt* limit_t) {
  if (stride_con > 0) {
    if (limit_t->_lo > max_jint - stride_con) {
      return -1;
    }
    if (limit_t->_hi > max_jint - stride_con) {
      return 1;
    }
  } else {
    if (limit_t->_hi < min_jint - stride_con) {
      return -1;
    }
    if (limit_t->_lo < min_jint - stride_con) {
      return 1;
    }
  }
  return 0;
}
static int check_stride_overflow(jlong stride_con, const TypeLong* limit_t) {
  if (stride_con > 0) {
    if (limit_t->_lo > max_jlong - stride_con) {
      return -1;
    }
    if (limit_t->_hi > max_jlong - stride_con) {
      return 1;
    }
  } else {
    if (limit_t->_hi < min_jlong - stride_con) {
      return -1;
    }
    if (limit_t->_lo < min_jlong - stride_con) {
      return 1;
    }
  }
  return 0;
}

Then use the functions appropriately, which will tend to make
the case analysis a little easier to understand.  In the case of the
32-bit function, it will simplify (and clarify) the tri-state logic
the begins like this:

  if (limit->is_Con()) {
    int limit_con = limit->get_int();
    if ((stride_con > 0 && limit_con > (max_jint - stride_m)) ||
        (stride_con < 0 && limit_con < (min_jint - stride_m))) {
      // Bailout: it could be integer overflow.

The simplified could look like:

  int sov = check_stride_overflow(stride_m, limit_t);
  // If sov==0, limit's type always satisfies the condition, for example,
  // when it is an array length.
  if (sov != 0) {
     if (sov < 0) {
      return false;  // Bailout: integer overflow is certain.
    }
    … prepare dynamic limit check …
  }

I think I’m on the right track here because in both 32-bit and 64-bit
code these checks adjoin predicate insertion logic, specifically the
call to find_predicate_insertion_point.

The 64-bit version of the code should look similar, I think.  Since
some of the predicate insertion work is delayed, a “todo” flag might
be set.  (You may notice I like to use names for reifying logical relations
between different parts of the code.)

The new node inner_iters_max would read a little better with a
comment:

+  _igvn.register_new_node_with_optimizer(inner_iters_max);
++  // inner_iters_max is MAX(0, adjusted_limit - iv), when stride > 0

The definition of adjusted_limit also deserves comment, for that
matter.  Perhaps this one, lifted from the 32-bit code:

    // If compare points directly to the phi we need to adjust
    // the compare so that it points to the incr.

In fact, the 32-bit code near that comment should, in my opinion,
also be edited to use a different variable adjusted_limit, even though
that variable will have a short span.  There is value to having the
32-bit and 64-bit code look as similar as possible.  The fact that
this comment already occurs twice in the 32-bit code is additional
evidence that there should be an adjusted_limit variable, defined
near the *first* occurrence of that comment, in the 32-bit code.
I think such a cleanup is a desirable way to reduce the cost of
making a new almost-copy of the 32-bit code, in its 64-bit form.

Another milestone in the arithmetic the deserves a comment
is this definition:

+  _igvn.register_new_node_with_optimizer(inner_iters_actual);
++  // inner_iters_actual is unsigned MIN(inner_iters_max, max_jint - ABS(stride))
++  // this is the 32-bit number of iterations to execute in the inner loop

(Why is it unsigned?  I think its operands are never negative.)

You know, after decoding the above MIN and MAX expressions,
it seems to me that it might be time to introduce helper functions
in GraphKit to create such MIN and MAX instructions.  They are
really hard to read as-is.  Something like:

GraphKit::build_min_max(BasicType bt, bool is_max, bool is_unsigned);
GraphKit::signed_max(BasicType bt) { return build_min_max(bt, true, false); }
GraphKit::unsigned_min(BasicType bt) { return build_min_max(bt, false, true); }
etc.

There’s a backtrack at this point:

+    // That fails. Undo graph changes we've done so far.

I think that should collect a count somewhere, to be reported as part
of statistics.  That way, when you run stress tests, you can ensure that
they include this backtrack path.  I see that’s counted, in part, by the
difference of _long_loops_success and _long_loops, but maybe another
counter here wouldn’t be a bad idea, since there is lot of work to undo
at this very late point.

BTW, I like StressLongCountedLoop a lot; it’s a nice test.  I suggest
a second stress mode, in which the pinning against the jint range
(in places like max_jint - ABS(stride)) is replaced by pinning against
a value like max_jint/100.  The point would be to ensure that both
outer and inner layers of the decomposed loops get a chance to run.
As StressLongCountedLoop is currently formulated, it will only
run the outer loop once, right?  The stress mode logic also deserves
a counter, so we can tell how many loops were actually promoted.

In the finale of the code, the IV and its increment IV+S are replaced
by (OUTERIV + INNERIV) and (OUTERIV + INNERIV + S).  The two
loops that handle these chores are very similar.  Would you mind
putting them into a common helper function?  I think that would
make the code easier to maintain and understand.

I think you need another reviewer, one currently working on the
JIT team.  But I like this code, and you can cite me as a reviewer.

— John