RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v2]

Mon Oct 14 12:15:11 UTC 2024

On Fri, 11 Oct 2024 17:12:49 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

> > I am having a similar idea that is to group those transformations together into a `Phase` called `PhaseLowering`
> 
> I think such a phase could be quite useful in general. Recently I was trying to implement the BMI1 instruction `bextr` for better performance with bit masks, but ran into a problem where it doesn't have an immediate encoding so we'd need to manifest a constant into a temporary register every time. With an (x86-specific) ideal node, we could simply let the register allocator handle placing the constant. It would also be nice to avoid needing to put similar backend-specific lowerings (such as `MacroLogicV`) in shared code.

Hey @jaskarth , @merykitty ,  we already have an infrastructure where during parsing we create Macro Nodes which can be lowered / expanded to multiple IRs nodes during macro expansion, what we need in this case is a target specific IR pattern check since not all targets may support 32x32 multiplication with quadword saturation, idea is to avoid creating a new IR and piggyback needed on existing MulVL IR, we already use such tricks for relaxed unsafe reductions. Patch is performing point optimization for specific set of constrained multiplication patterns.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2411053693