RFR: 8259609: C2: optimize long range checks in long counted loops [v3]

Thu Mar 4 13:18:40 UTC 2021

On Mon, 1 Feb 2021 08:54:49 GMT, John R Rose <jrose at openjdk.org> wrote:

>> @TobiHartmann thanks for the comments. I pushed a change that should address them.
>
> Partial review:
> Yes, good cleanups along with great new functionality for big data!  I
> like the refactorings which use the new generic factory methods from
> JDK-8255150.
> 
> The comment in `PhaseIdealLoop::transform_long_range_checks` is very
> good.
> 
> PhaseTransform::integercon has a bug left over from JDK-8256655,
> the cast in `assert(((long)int_con) == l, "not an int")` is wrong.
> Just use `intcon(checked_cast<jint>(l))` as you already do elsewhere.
> 
> (I also noticed also that `PhaseIdealLoop::exact_limit` and
> `LoopLimitNode::value` have similar asserts which could be replaced
> with `checked_cast`.  In fact, `PhaseIdealLoop::exact_limit` is a
> duplicated from `LoopLimitNode::value` just to avoid creating a node,
> yuck.  That fiddly stuff should have been written as few times as
> possible.  I suggest putting it into a static inline in the class
> `LoopLimitNode`, where it will also serve as documentation.)
> 
> typo:  /had a change to be hoisted yet/s/change/chance/
> 
> Also there are unusually long comment lines there and inside
> `transform_long_range_checks`. 130 columns plus is longer than my
> widest laptop window holds unless I aggravate my presbyopia with
> a smaller font size.  Maybe, consider reflowing?  I know some
> of those lines are intrinsically long.
> 
> This variable is unused:  `Node* long_stride = head->stride();`
> 
> The argument `signed_int` is unused in `MulNode::operates_on`.
> 
> Well done adapting `is_scaled_iv` and friends to T_LONG!
> 
> The logic for turning a shift amount into a scale factor looks
> flawed, at least if `shift_amount` is ever out of range for the
> corresponding basic type.  I suggest being safe rather than
> sorry:   `shift_amount &= (bt == T_INT ? 31 : 63);  //clip`
> (There are about 3 possible bugs with out-of-range shifts.)
> Even better:  Use both overloadings of `java_shift_left`
> from `globalDefinitions.hpp`; I think that would be the
> best practice.
> 
>> I realize
>   there's some code duplication but I didn't see a way to share logic
>   between IdealLoopTree::may_have_range_check()
>   IdealLoopTree::policy_range_check() that would feel right.
> 
> I did a direct `diff -w` of the two function bodies and they are
> so similar that I think it is better to maintain them as one copy
> of the code.  I suggest adding a boolean flag `provisional`
> that activates the changed bits of logic:
> 
> bool IdealLoopTree::policy_range_check(PhaseIdealLoop *phase, bool provisional = false) const {
> + if (_head->is_CountedLoop()) {
>     CountedLoopNode *cl = _head->as_CountedLoop();
>     ...existing stuff about unroll_only etc...
> + }
> + BaseCountedLoopNode *cl = _head->as_BaseCountedLoop();
>   Node *trip_counter = cl->phi();
> + BasicType bt = cl->bt();
>   for (...) {
>       ...
>       Node *cmp = bol->in(1);
>       Node *rc_exp = cmp->in(1);
>       Node *limit = cmp->in(2);
> +     if (provisional) {
> +       // Try to pattern match with either cmp inputs, do not check whether one of the
> +       // inputs is loop independent as it may not have had a chance to be hoisted yet.
> +       if (!phase->is_scaled_iv_plus_offset(rc_exp, trip_counter, NULL, NULL, bt) &&
> +           !phase->is_scaled_iv_plus_offset(limit,  trip_counter, NULL, NULL, bt)) {
>         continue;
> +     } else {  // check loop independence if non-provisional
>          Node *limit_c = phase->get_ctrl(limit);
>          ...existing checks & swap of rc_exp and limit...
>          if (!phase->is_scaled_iv_plus_offset(rc_exp, trip_counter, NULL, NULL)) {
>            continue;
>          }
> +     }
>       if (is_loop_exit(iff)) {
>         // Found valid reason to split iterations (if there is room).
>         // NOTE: Usually a gross overestimate.
>         return provisional || phase->may_require_nodes(est_loop_clone_sz(2));
>       }
>   }
> }
> 
> Now for the biggest part:  You are replacing long range checks with
> 32-bit range checks which then can be RCE-ed.  You dial back the
> strip-mining count as needed to make sure the various scaled index
> computations don't suffer overflow.  Very clever!
> 
> I suggest placing `extract_long_range_checks` before
> `transform_long_range_checks` because that's the order they are
> executed.  That way the two can be read more easily in sequence.
> 
> The variable `jlong stride_con` should be `int`.  It is never used
> for anything bigger, or else an assert would fire.
> 
> IMO the expression `scale > 0 != stride_con > 0` needs parens around
> each of the two comparisons.  This is a case of relying on obscure
> C language precedence rules.  (I had to look it up myself!)
> 
> This group of definitions is loop-invariant, and should probably
> be set up once outside of the for-loop:
> 
> // Compute lower and upper
> Node* last = new LoopLimitNode(C, int_zero, inner_iters_actual_int, int_stride);
> register_new_node(last, entry_control);
> last = new SubINode(last, int_stride);
> register_new_node(last, entry_control);
> 
> Node* lower = outer_phi;
> Node* upper = new ConvI2LNode(last);
> register_new_node(upper, entry_control);
> upper = new AddLNode(lower, upper);
> register_new_node(upper, entry_control);
> 
> In `new AddLNode(upper, long_one)`, I think `long_one` should
> have the same sign as the stride.  But I'm not sure (yet).
> 
> You have an expression for accumulating the maximum scale:
> `max_scale = MAX2(max_scale, scale)`.  I think that the
> accumulated value should not be negative, so you probably
> want `max_scale = MAX2(max_scale, ABS(scale))`.
> 
> But, I suggest getting rid of `max_scale` altogether.
> The logic will be simpler if it works like this:
> 
> jlong original_iters_limit = iters_limit;
> jlong reduced_iters_limit = iters_limit;
> for (...) {
>   ... jlong new_limit = original_iters_limit / ABS(scale * stride_con)
>   if (new_limit >= min_iters) { // ??? or > ???
>     reduced_iters_limit = MIN2(reduced_iters_limit, new_limit);
>     range_checks.push(c);
> 
>   }
> }
> return checked_cast<int>(reduced_iters_limit);
> 
> If you do this you won't need that delicate assert at the end.
> 
> More later.  I'm still working through the core logic in
> `transform_long_range_checks`, and I will propose more changes
> there.

comment to keep the PR open

-------------

PR: https://git.openjdk.java.net/jdk/pull/2045