RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Wed Nov 6 17:39:41 UTC 2024

On Tue, 5 Nov 2024 00:07:51 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> Thanks, @jatin-bhateja. I took a look at the latest version and still think that IGVN is not the best place for it. 
>> 
>> First of all, flags on MulVL feel too adhoc and irregular. The original IR structure is still there (except the cases when inputs are rewired), so can be easily recomputed on demand.
>> 
>> I noticed that the patterns can be generalized: what matters is whether upper half is filled with zeros/sign bits or not, so small enough masks (and large enough shifts) are amenable to the same optimization. But, in such case, input rewiring becomes applicable only to particular constant inputs.
>> 
>> (BTW signed right shifts can be optimized in a similar way, since they populate upper half with the sign-bit.)
>> 
>> So, IMO the best way to move this particular enhancement forward is:
>> * perform the transformation during matching;
>> * match a single MulVL node and shape the checks on argument shape as predicates on AD instructions 
>>    * setting lower instruction costs should tell the matcher to prefer new specific instructions over generic ones;
>> * avoid input rewiring for now (VPMULDQ/VPMULUDQ give enough performance improvement on its own).
>
>> So, IMO the best way to move this particular enhancement forward is: ...
> 
> @jatin-bhateja here's a sketch (not tested): https://github.com/openjdk/jdk/compare/master...iwanowww:jdk:pr/21244

Hi @iwanowww ,

Thanks for refactoring! your suggestions are included. 

Points in favor of the current approach:-
- Patch strength reduces 15 cycles full quadword multiplier to 5 cycles double word multiplier with quadword saturation.
- IR remains target independent, we are not directly forwarding the pattern inputs to the multiplier, such rewiring is only possible when we mask out the upper double word of inputs, for other cases like right shifting (logical) inputs by 32 or upcasting integral to long lanes we still need to emit the input preparation/formatting instruction sequence. 
- Patch shows performance improvement on both E and P core Xeons.

Following are the performance number for include micro benchmarks.

![image](https://github.com/user-attachments/assets/6a19181a-7f55-4cd8-9dfb-23dd4c786428)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2459806910