RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Wed Nov 6 17:39:34 UTC 2024

On Thu, 17 Oct 2024 19:40:52 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> MulVL (VectorCastI2X src1) (VectorCastI2X src2)

> It looks unsafe to me, since VectorCastI2L sign-extends integer lanes, ...

Hm, I don't see any problems with it if `VPMULDQ` is used. Sign extension becomes redundant when 64-bit multiplication is strength-reduced to 32-bit one (32x32->64). Am I missing something important here?

>> IMO until C2 type system starts to track bitwise constant information ([JDK-8001436](https://bugs.openjdk.org/browse/JDK-8001436) et al), there are not enough benefits to rely on IGVN here. So far, all the discussed patterns are simple enough for matcher to handle them without too much tweaking.
>
> Hi @iwanowww , 
> I have implemented additional pattern you suggested.
> In addition re-wiring pattern inputs to MulVL IR to avoid emitting upper doubleword clearing logic in applicable scenarios. 
> 
> Hi @jaskarth , @merykitty ,
> As discussed, waiting on PhaseLowering skeleton to move some part of this patch to x86 specific lowering pass.

Thanks, @jatin-bhateja. I took a look at the latest version and still think that IGVN is not the best place for it. 

First of all, flags on MulVL feel too adhoc and irregular. The original IR structure is still there (except the cases when inputs are rewired), so can be easily recomputed on demand.

I noticed that the patterns can be generalized: what matters is whether upper half is filled with zeros/sign bits or not, so small enough masks (and large enough shifts) are amenable to the same optimization. But, in such case, input rewiring becomes applicable only to particular constant inputs.

(BTW signed right shifts can be optimized in a similar way, since they populate upper half with the sign-bit.)

So, IMO the best way to move this particular enhancement forward is:
* perform the transformation during matching;
* match a single MulVL node and shape the checks on argument shape as predicates on AD instructions 
   * setting lower instruction costs should tell the matcher to prefer new specific instructions over generic ones;
* avoid input rewiring for now (VPMULDQ/VPMULUDQ give enough performance improvement on its own).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2420668490
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2436528498