RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Wed Nov 6 17:39:30 UTC 2024

On Mon, 14 Oct 2024 12:12:58 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>>> I am having a similar idea that is to group those transformations together into a `Phase` called `PhaseLowering`
>> 
>> I think such a phase could be quite useful in general. Recently I was trying to implement the BMI1 instruction `bextr` for better performance with bit masks, but ran into a problem where it doesn't have an immediate encoding so we'd need to manifest a constant into a temporary register every time. With an (x86-specific) ideal node, we could simply let the register allocator handle placing the constant. It would also be nice to avoid needing to put similar backend-specific lowerings (such as `MacroLogicV`) in shared code.
>
>> > I am having a similar idea that is to group those transformations together into a `Phase` called `PhaseLowering`
>> 
>> I think such a phase could be quite useful in general. Recently I was trying to implement the BMI1 instruction `bextr` for better performance with bit masks, but ran into a problem where it doesn't have an immediate encoding so we'd need to manifest a constant into a temporary register every time. With an (x86-specific) ideal node, we could simply let the register allocator handle placing the constant. It would also be nice to avoid needing to put similar backend-specific lowerings (such as `MacroLogicV`) in shared code.
> 
> Hey @jaskarth , @merykitty ,  we already have an infrastructure where during parsing we create Macro Nodes which can be lowered / expanded to multiple IRs nodes during macro expansion, what we need in this case is a target specific IR pattern check since not all targets may support 32x32 multiplication with quadword saturation, idea is to avoid creating a new IR and piggyback needed information on existing MulVL IR, we already use such tricks for relaxed unsafe reductions. Going forward, infusion of KnownBits into our data flow analysis infrastructure will streamline such optimizations, this patch is performing point optimization for specific set of constrained multiplication patterns.

@jatin-bhateja That is machine-independent lowering, we are talking about machine-dependent lowering to which `MacroLogicV` transformation belongs. You can have `phaselowering_x86` and not have to add another method to `Matcher` as well as add default implementations to various architecture files. You can reuse `MulVL` node for that but I believe these transformations should be done as late as possible.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2411389030