RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v4]

Fri Oct 18 05:41:41 UTC 2024

On Fri, 18 Oct 2024 05:35:28 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>> 
>>  - Review resolutions
>>  - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction
>
> src/hotspot/share/opto/vectornode.cpp line 2122:
> 
>> 2120:     // MulL (URShift SRC1 , 32) (URShift SRC2, 32)
>> 2121:     // MulL (URShift SRC1 , 32)  ( And  SRC2,  0xFFFFFFFF)
>> 2122:     // MulL ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
> 
> I don't understand how it works... According to the documentation, `VPMULDQ`/`VPMULUDQ` consume vectors of double words and produce a vector of quadwords. But it looks like `SRC1`/`SRC2` are always vectors of longs (quadwords). And `vmuludq_reg` in `x86.ad` just takes the immedate operands and pass them into `vpmuludq` which doesn't look right...

`vpmuludq` does a long multiplication but throws away the upper bits of the operands, effectively does a `(x & max_juint) * (y & max_juint)`

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1805887594