RFR(M): 8223051: support loops with long (64b) trip counts

Mon May 4 16:24:17 UTC 2020

Thanks for the careful review.

> It would be good to put this pseudo-code above the definition
> of PhaseIdealLoop::is_long_counted_loop, but with names
> adjusted to match the code.
>
> Let me sketch it in terms of the names in the code:
>
> L: for (long phi = init; phi < limit; phi += stride) {
>    // phi := Phi(L, init, phi + stride)
>    … use phi and (phi + stride) …
> }
>
> ==transform=>
>
> const long inner_iters_limit = INT_MAX - stride;
> assert(stride <= inner_iters_limit);  // else deopt

The assertion above is statically determined to be true, right? It's not
a runtime check that can cause a deopt.

> In this webrev I didn’t see a discussion of “what does concrete
> mean”; I assume it is elsewhere, and that I would know it if I were
> familiar with the loop optimizations (which very few people are!).
> Maybe add:
> + // See 'GraphKit::add_empty_predicates'.
>
> (Is a concrete predicate one which was inserted above an empty
> place-holder by a predication transformation?  In that case, I

Yes, they are. I think Christian is the one that introduced the term
"concrete" in one of his recent changes.

> might wish to call those “predication tests”, or some other term
> tied to the name of a specific optimization transform,  rather
> than “concrete predicates”.)

There are at least 3 kind predicates: the place holder inserted at parse
time, the tests added by predication above the place holder (the
concrete ones), skeleton predicates that are added between main loop and
pre loop so C2 doesn't crash in some rare cases of over unrolling (a
recent addition). Skeleton predicates themselves are expanded and
updated as unrolling proceeds (creating a 4th kind of predicates?). They
don't compile to any code. As you can tell, this has all gotten
complicated and confusing and the naming scheme (or lack of naming
scheme) hasn't helped. That would need to be revisited.

> I think you should also check for “stride_con == 0” here.  The 32-bit
> code checks for zero strides but I don’t see it here.  Perhaps that would
> fold up later after the 32-bit loop is created?  But it seems tidier to keep
> the 32-bit and 64-bit code as parallel as possible.

Isn't stride_con == 0 actually a bug? IGVN should have folded it but it
was missed so rather than silently ignoring it wouldn't it be better to
assert stride_con != 0?

> The following logic is hard for me to prove correct, and I would prefer it
> to miss a few valid cases rather in a simpler form:
> +  if (phi_incr != NULL && iters_limit <= ABS(stride_con)) {
>
> I suggest either detuning the check in its current place:
> ++  if (iters_limit <= ABS(stride_con)) {

I don't think that test is needed actually. If the loop exit test is
performed on the iv before increment, I now transform the exit test from
iv < limit to iv + stride < limit + stride
but before I settled for that I tried having the int counted loop handle
that transformation. That test must be a left over from those attempts.

> Also, the immediately following range check logic is very odd,
> and doesn’t correspond to anything in the 32-bit loop code.
> What you appear to be doing here is checking how limit_t
> is going to behave with respect to the stride and there are
> three possibilities:  It can sometimes cause 64-bit overflow,
> or it cannot cause 64-bit overflow, or it *must* cause 64-bit
> overflow.  The two parts of this check are widely separately
> and not obviously connected, although they are linked by
> a common task of predicate insertion.

It's different because the 32 bit loop code checks that both the iv
doesn't overflow and that the limit adjustment required when the compare
point to the phi in one go. I don't need to the check that iv doesn't
overflow (I set the limit of the inner loop so it doesn't). 

>
> The new node inner_iters_max would read a little better with a
> comment:
>
> +  _igvn.register_new_node_with_optimizer(inner_iters_max);
> ++  // inner_iters_max is MAX(0, adjusted_limit - iv), when stride > 0
>
> The definition of adjusted_limit also deserves comment, for that
> matter.  Perhaps this one, lifted from the 32-bit code:
>
>     // If compare points directly to the phi we need to adjust
>     // the compare so that it points to the incr.
>
> In fact, the 32-bit code near that comment should, in my opinion,
> also be edited to use a different variable adjusted_limit, even though
> that variable will have a short span.  There is value to having the
> 32-bit and 64-bit code look as similar as possible.  The fact that
> this comment already occurs twice in the 32-bit code is additional
> evidence that there should be an adjusted_limit variable, defined
> near the *first* occurrence of that comment, in the 32-bit code.
> I think such a cleanup is a desirable way to reduce the cost of
> making a new almost-copy of the 32-bit code, in its 64-bit form.
>
> Another milestone in the arithmetic the deserves a comment
> is this definition:
>
> +  _igvn.register_new_node_with_optimizer(inner_iters_actual);
> ++  // inner_iters_actual is unsigned MIN(inner_iters_max, max_jint - ABS(stride))
> ++  // this is the 32-bit number of iterations to execute in the inner loop
>
> (Why is it unsigned?  I think its operands are never negative.)

inner_iters_max is an unsigned integer. If the loop is:

for (long l = Long.MIN_VALUE; l < Long.MAX_VALUE; l++) {

then inners_iter_max doesn't fit in the signed long integer range.

> There’s a backtrack at this point:
>
> +    // That fails. Undo graph changes we've done so far.
>
> I think that should collect a count somewhere, to be reported as part
> of statistics.  That way, when you run stress tests, you can ensure that
> they include this backtrack path.  I see that’s counted, in part, by the
> difference of _long_loops_success and _long_loops, but maybe another
> counter here wouldn’t be a bad idea, since there is lot of work to undo
> at this very late point.

One issue with the counters I added is that they are hard to interpret:
if the transformation of long loops fails, it's attempted again at every
pass of loop opts and so _long_loops is incremented every time. So
_long_loops_success < _long_loops tell us there were some failures but
not how many.

> BTW, I like StressLongCountedLoop a lot; it’s a nice test.  I suggest
> a second stress mode, in which the pinning against the jint range
> (in places like max_jint - ABS(stride)) is replaced by pinning against
> a value like max_jint/100.  The point would be to ensure that both
> outer and inner layers of the decomposed loops get a chance to run.
> As StressLongCountedLoop is currently formulated, it will only
> run the outer loop once, right?  The stress mode logic also deserves
> a counter, so we can tell how many loops were actually promoted.

Right, the outer loop should be executed only once.

I'm working on an updated patch.

Roland.