RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Wed Nov 6 17:39:34 UTC 2024

On Tue, 15 Oct 2024 17:26:49 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> I'm pretty ambivalent, I think implementing it either way would be alright. Especially with unit tests, I think the lowering implementation wouldn't be that difficult. Maybe another reviewer has an opinion?
>> 
>> About PhaseLowering though, I've found some more interesting things we could do with it, especially with improving vectorization support in the backend. @merykitty have you already started to work on it? I was thinking about prototyping it soon. Just wanted to make sure we're not doing the same work twice :)
>
> @jaskarth Please proceed with it, I have a really simple prototype for it but I don't have any plan to proceed further soon. Thanks a lot :)

@merykitty The approach @jatin-bhateja proposes looks well-justified to me. Matching is essentially a lowering step which transforms platform-independent Ideal IR into platform-specific Mach IR. And collapsing non-trivial IR trees into platform-specific instructions is a well-established pattern in the code.

Indeed, there are some constraints matching imposes, so it may not be flexible enough to cover all use cases. In particular, for `VPTERNLOGD`/`VPTERNLOGQ` it was decided it's worth the effort to handle them specially (see `Compile::optimize_logic_cones()`). As it is implemented now, it's part of the shared code, but if there's platform-specific custom lowering phase available one day, it can be moved there, of course.

But speaking of `VPMULDQ`/`VPMULUDQ`, what kind of benefits do you see from custom logic to support them?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2420732705