RFR: 8259609: C2: optimize long range checks in long counted loops [v3]

Mon Feb 1 08:57:43 UTC 2021

On Wed, 27 Jan 2021 08:32:42 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> I did a first pass over the changes and added some comments but I need more time to review.
>
> @TobiHartmann thanks for the comments. I pushed a change that should address them.

Partial review:
Yes, good cleanups along with great new functionality for big data!  I
like the refactorings which use the new generic factory methods from
JDK-8255150.

The comment in `PhaseIdealLoop::transform_long_range_checks` is very
good.

PhaseTransform::integercon has a bug left over from JDK-8256655,
the cast in `assert(((long)int_con) == l, "not an int")` is wrong.
Just use `intcon(checked_cast<jint>(l))` as you already do elsewhere.

(I also noticed also that `PhaseIdealLoop::exact_limit` and
`LoopLimitNode::value` have similar asserts which could be replaced
with `checked_cast`.  In fact, `PhaseIdealLoop::exact_limit` is a
duplicated from `LoopLimitNode::value` just to avoid creating a node,
yuck.  That fiddly stuff should have been written as few times as
possible.  I suggest putting it into a static inline in the class
`LoopLimitNode`, where it will also serve as documentation.)

typo:  /had a change to be hoisted yet/s/change/chance/

Also there are unusually long comment lines there and inside
`transform_long_range_checks`. 130 columns plus is longer than my
widest laptop window holds unless I aggravate my presbyopia with
a smaller font size.  Maybe, consider reflowing?  I know some
of those lines are intrinsically long.

This variable is unused:  `Node* long_stride = head->stride();`

The argument `signed_int` is unused in `MulNode::operates_on`.

Well done adapting `is_scaled_iv` and friends to T_LONG!

The logic for turning a shift amount into a scale factor looks
flawed, at least if `shift_amount` is ever out of range for the
corresponding basic type.  I suggest being safe rather than
sorry:   `shift_amount &= (bt == T_INT ? 31 : 63);  //clip`
(There are about 3 possible bugs with out-of-range shifts.)
Even better:  Use both overloadings of `java_shift_left`
from `globalDefinitions.hpp`; I think that would be the
best practice.

> I realize
  there's some code duplication but I didn't see a way to share logic
  between IdealLoopTree::may_have_range_check()
  IdealLoopTree::policy_range_check() that would feel right.

I did a direct `diff -w` of the two function bodies and they are
so similar that I think it is better to maintain them as one copy
of the code.  I suggest adding a boolean flag `provisional`
that activates the changed bits of logic:

bool IdealLoopTree::policy_range_check(PhaseIdealLoop *phase, bool provisional = false) const {
+ if (_head->is_CountedLoop()) {
    CountedLoopNode *cl = _head->as_CountedLoop();
    ...existing stuff about unroll_only etc...
+ }
+ BaseCountedLoopNode *cl = _head->as_BaseCountedLoop();
  Node *trip_counter = cl->phi();
+ BasicType bt = cl->bt();
  for (...) {
      ...
      Node *cmp = bol->in(1);
      Node *rc_exp = cmp->in(1);
      Node *limit = cmp->in(2);
+     if (provisional) {
+       // Try to pattern match with either cmp inputs, do not check whether one of the
+       // inputs is loop independent as it may not have had a chance to be hoisted yet.
+       if (!phase->is_scaled_iv_plus_offset(rc_exp, trip_counter, NULL, NULL, bt) &&
+           !phase->is_scaled_iv_plus_offset(limit,  trip_counter, NULL, NULL, bt)) {
        continue;
+     } else {  // check loop independence if non-provisional
         Node *limit_c = phase->get_ctrl(limit);
         ...existing checks & swap of rc_exp and limit...
         if (!phase->is_scaled_iv_plus_offset(rc_exp, trip_counter, NULL, NULL)) {
           continue;
         }
+     }
      if (is_loop_exit(iff)) {
        // Found valid reason to split iterations (if there is room).
        // NOTE: Usually a gross overestimate.
        return provisional || phase->may_require_nodes(est_loop_clone_sz(2));
      }
  }
}

Now for the biggest part:  You are replacing long range checks with
32-bit range checks which then can be RCE-ed.  You dial back the
strip-mining count as needed to make sure the various scaled index
computations don't suffer overflow.  Very clever!

I suggest placing `extract_long_range_checks` before
`transform_long_range_checks` because that's the order they are
executed.  That way the two can be read more easily in sequence.

The variable `jlong stride_con` should be `int`.  It is never used
for anything bigger, or else an assert would fire.

IMO the expression `scale > 0 != stride_con > 0` needs parens around
each of the two comparisons.  This is a case of relying on obscure
C language precedence rules.  (I had to look it up myself!)

This group of definitions is loop-invariant, and should probably
be set up once outside of the for-loop:

// Compute lower and upper
Node* last = new LoopLimitNode(C, int_zero, inner_iters_actual_int, int_stride);
register_new_node(last, entry_control);
last = new SubINode(last, int_stride);
register_new_node(last, entry_control);

Node* lower = outer_phi;
Node* upper = new ConvI2LNode(last);
register_new_node(upper, entry_control);
upper = new AddLNode(lower, upper);
register_new_node(upper, entry_control);

In `new AddLNode(upper, long_one)`, I think `long_one` should
have the same sign as the stride.  But I'm not sure (yet).

You have an expression for accumulating the maximum scale:
`max_scale = MAX2(max_scale, scale)`.  I think that the
accumulated value should not be negative, so you probably
want `max_scale = MAX2(max_scale, ABS(scale))`.

But, I suggest getting rid of `max_scale` altogether.
The logic will be simpler if it works like this:

jlong original_iters_limit = iters_limit;
jlong reduced_iters_limit = iters_limit;
for (...) {
  ... jlong new_limit = original_iters_limit / ABS(scale * stride_con)
  if (new_limit >= min_iters) { // ??? or > ???
    reduced_iters_limit = MIN2(reduced_iters_limit, new_limit);
    range_checks.push(c);

  }
}
return checked_cast<int>(reduced_iters_limit);

If you do this you won't need that delicate assert at the end.

More later.  I'm still working through the core logic in
`transform_long_range_checks`, and I will propose more changes
there.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2045