RFR: 8330274: C2 SuperWord: VPointer invar: same sum with different addition order should be equal

Mon Apr 22 11:01:27 UTC 2024

On Mon, 22 Apr 2024 07:53:05 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:

>> This is an enhancement for AutoVectorization.
>> 
>> I want to improve the detection of `invar`s that are equivalent (guaranteed to compute the same value), but don't have the identical node (the computation is in a different order).
>> 
>> Note: only about 100 lines are real changes, the rest is tests. These are the first tests that check vectorization for MemorySegments.
>> 
>> **Solution Sketch: "canonicalize" the invar**
>> 
>> - Extract all summands of the `invar`: make a list.
>>   - Parse through `AddL`, `SubL`, `AddI`, `SubI`, to get summands.
>>   - Bypass `CastLL` and `CastII`
>>   - Recursively treat `ConvI2L`, `LShiftI` and `LShiftL`: i.e. canonicalize their input.
>> 
>> - Sort all extracted summands by node idx.
>> - Add up all summands in new order.
>> 
>> If two `invar`s use the same summands, then we know that after canonicalization the new nodes representing the `invar`s must be the same.
>> 
>> **Example**
>> 
>> 
>> invar1 = b + c + d + a
>> invar2 = d + b + a + c
>> 
>> -> equivalent but not identical nodes
>> 
>> Sort, and add up again:
>> 
>> invar1 = a + b + c + d
>> invar2 = a + b + c + d
>> 
>> -> now the nodes are identical
>> 
>> **Motivation: MemorySegment with invar**
>> 
>> One might think that this is a big of a special case: why would anybody write indices to an Array or MemorySegment where the invar has a different addition order for its summands?
>> 
>> This example did not vectorize, even though it should:
>> https://github.com/openjdk/jdk/blob/78e42d6e311c33548d16c6c74493388d9850238e/test/hotspot/jtreg/compiler/loopopts/superword/TestEquivalentInvariants.java#L425-L441
>> 
>> Both the `get` and the `set` look like they have the same address, and the address increases by a byte in each iteration.
>> 
>> Upon inspection, I saw that the `invar` that `VPointer` produces for the two operations are not identical: the order of addition of the `invar`'s summands is different, and thus the `invar` nodes are different.
>> 
>> The consequence: Only if we can prove that the two `invar` are identical can we know that the addresses are identical, and that there is no aliasing for loop carried dependencies. Since we have different `invar`, we don't know how the two addresses alias, and that prevents vectorization.
>> 
>> Why does this happen? After parsing, the graph looks like this:
>> ![image](https://github.com/openjdk/jdk/assets/32593061/f768d0b0-0b2f-48f0-bfdc-61e93e62bb4f)
>> 
>> We already see that the two addresses are different only by a `CastLL`, with type `long:>=0`. So...
>
> Nice analysis, Emanuel!
> 
>> An alternative solution would have been to add a corresponding CastLL for both the load and store, and hope that this means that the final address would look identical, though I don't know how difficult that would be.
> 
> I think it would be worth exploring this alternative before committing to the canonicalization solution proposed here. If we could get GVN to find the equivalence instead, that would be a more general solution in the sense that other transformations may also benefit from it.

@robcasloz I discussed it a little with @chhagedorn.

We think it would be tricky to get such a IGVN optimization right, and possibly it would not work for all cases.

Basically, we would have some value `v` with a direct use `use1` and an indirect use `use2` via some `CastLL`.

    v
 +-+-------+
 |         |
use1      CastLL (with dependency, and constrained type)
           |
          use2

The `CastLL` has a `If` dependency, for some `RangeCheck` for example, and therefore it constrains also the type of `v`.

To get `use1` and `use2` to have the same input, we would either have to:
- Remove the `CastLL`: not a good idea, this would lose type info and in some cases the lost pinning would lead some nodes further down to float up.
- Make `use1` use the `CastLL` instead of `v` directly. But then we would need to prove that all uses of `use1` are under the same dependency as the `CastLL`, and I think that would be tricky and error prone. And probably we could not make it work in all cases.

I would really like `SuperWord` to be robust to `CastII/LL`, and this optimization here does that. It also allows to parse more cases, i.e. `a + b + c` and `b + c + a`. I suspect that this could happen to a Java user, that they accidentally swap the addition order of an index (though maybe rare).

@robcasloz What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18795#issuecomment-2069098559