RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v2]
Vladimir Ivanov
vlivanov at openjdk.org
Tue Nov 12 21:53:09 UTC 2024
On Sun, 10 Nov 2024 07:40:30 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I don't see a compelling reason to do so. IMO matcher functionality is more than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + `has_int_inputs()` predicate. For `MulUIVL` you additionally do input rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra AD instruction); and (2) such optimization is limited because it is valid only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <= 0xFFFFFFFF`.
>>
>> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without improving situation during matching.
>
>> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I don't see a compelling reason to do so. IMO matcher functionality is more than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + `has_int_inputs()` predicate. For `MulUIVL` you additionally do input rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra AD instruction); and (2) such optimization is limited because it is valid only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <= 0xFFFFFFFF`.
>>
>> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without improving situation during matching.
>
> Hi @iwanowww ,
> Problem occurs only if AndV gets shared; in such a case, matcher will not be able to identify the constrained multiplication pattern and absorb the masking pattern. Specialized IR overrules such limitations and shields the pattern from downstream optimization passes, thereby removing any non-determinism. In addition, it facilitates forwarding inputs to the multiplier, the new IR is explicit in its semantics of considering only lower doublewords of quadword lanes for multiplication, hence we can safely save emitting redundant input masking instructions. We already have specialized IR nodes like MulAddVS2VINode and I see these new IR nodes similar to it.
@jatin-bhateja in case when `AndV` is shared, it can't be eliminated unless all users absorb it. For such cases, matcher can perform adhoc node cloning, but in this particular case it looks like an overkill either way. IMO the pattern is too niche to focus on it (either to justify input forwarding or adhoc handling on matcher side).
It's good you mentioned `MulAddVS2VI`. On one hand, VNNI operations are more complex (similar to FMA), so such complexity *may* be justified there. On the other hand, it doesn't look like VNNI support in C2 age well. It is tied to auto-vectorizer and, by now, Vector API doesn't benefit from it. So, instead of doubling down on `MulAddVS2VI` path, I'd prefer to leave it aside and reimplement it later in a more maintainable manner.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2471654154
More information about the hotspot-compiler-dev
mailing list