RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v2]

Wed Nov 13 02:43:12 UTC 2024

On Tue, 12 Nov 2024 21:49:22 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>>> In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I don't see a compelling reason to do so. IMO matcher functionality is more than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + `has_int_inputs()` predicate. For `MulUIVL` you additionally do input rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra AD instruction); and (2) such optimization is limited because it is valid only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <= 0xFFFFFFFF`.
>>> 
>>> So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without improving situation during matching.
>> 
>> Hi @iwanowww ,
>> Problem occurs only if AndV gets shared; in such a case, matcher will not be able to identify the constrained multiplication pattern and absorb the masking pattern. Specialized IR overrules such limitations and shields the pattern from downstream optimization passes, thereby removing any non-determinism. In addition, it facilitates forwarding inputs to the multiplier, the new IR is explicit in its semantics of considering only lower doublewords of quadword lanes for multiplication, hence we can safely save emitting redundant input masking instructions. We already have specialized IR nodes like MulAddVS2VINode and I see these new IR nodes similar to it.
>
> @jatin-bhateja in case when `AndV` is shared, it can't be eliminated unless all users absorb it. For such cases, matcher can perform adhoc node cloning, but in this particular case it looks like an overkill either way. IMO the pattern is too niche to focus on it (either to justify input forwarding or adhoc handling on matcher side). 
> 
> It's good you mentioned `MulAddVS2VI`. On one hand, VNNI operations are more complex (similar to FMA), so such complexity *may* be justified there. On the other hand, it doesn't look like VNNI support in C2 age well. It is tied to auto-vectorizer and, by now, Vector API doesn't benefit from it. So, instead of doubling down on `MulAddVS2VI` path, I'd prefer to leave it aside and reimplement it later in a more maintainable manner.

Thanks @iwanowww , Patch in its current form addresses several common patterns, and focusing on optimizing one niche case looks like an overkill,  given that we are strength reducing 15 cycle multiplier with a lighter 5-cycle version is itself sufficient to offset redundant input masking instructions.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2472244154