RFR: 8308994: C2: Re-implement experimental post loop vectorization [v2]

Pengfei Li pli at openjdk.org
Fri Jul 7 07:20:18 UTC 2023


On Mon, 3 Jul 2023 14:37:03 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> src/hotspot/share/opto/vmaskloop.cpp line 978:
>> 
>>> 976: 
>>> 977:   // Update loop increment/decrement to the vector mask true count
>>> 978:   Node* true_cnt = new VectorMaskTrueCountNode(root_vmask, TypeInt::INT);
>> 
>> This seems expensive to have to use inside the loop. Is there a way we could move this outside the loop? Because if we do take the backedge then we know that we have to take the full `stride`, right?
>
> I guess you would have to separate out the loop-internal uses and the outside uses of the `incr`. The inside uses would use the `stride` (or is there an exception?) and the outside ones could use the `VectorMaskTrueCountNode`.
> 
> Doing something like that could have better performance.

> This seems expensive to have to use inside the loop. Is there a way we could move this outside the loop? Because if we do take the backedge then we know that we have to take the full stride, right?

It's not completely right. We have tried using multiplied stride inside the loop and just handle out-of-loop uses of the `incr` node. Mis-compilation happens in some very corner cases where the loop limit value is very close to the max value of `int`, like in below case.

for (int i = 2147483600; i < 2147483645; i++) {
  // ...
}

If we always take the full stride inside the vectorized loop, the induction variable may overflow and is rotated to a negative value before it reaches the loop limit. This causes the backedge is taken forever and the finite loop is optimized to an infinite loop.

I see that for general counted loops, C2 inserts some limit check predicate in the counted loop construction phase to avoid this issue (it's implemented in `PhaseIdealLoop::insert_loop_limit_check_predicate()`). But I'm not sure if it is possible (and worthy) to add similar limit check predicate for post loops. It looks that current C2 post loops have no place to add extra loop predicates. What's your suggestion for this?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1255363034


More information about the hotspot-compiler-dev mailing list