RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v4]
Quan Anh Mai
qamai at openjdk.org
Fri Oct 18 05:41:41 UTC 2024
On Fri, 18 Oct 2024 05:35:28 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:
>> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>>
>> - Review resolutions
>> - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction
>
> src/hotspot/share/opto/vectornode.cpp line 2122:
>
>> 2120: // MulL (URShift SRC1 , 32) (URShift SRC2, 32)
>> 2121: // MulL (URShift SRC1 , 32) ( And SRC2, 0xFFFFFFFF)
>> 2122: // MulL ( And SRC1, 0xFFFFFFFF) (URShift SRC2 , 32)
>
> I don't understand how it works... According to the documentation, `VPMULDQ`/`VPMULUDQ` consume vectors of double words and produce a vector of quadwords. But it looks like `SRC1`/`SRC2` are always vectors of longs (quadwords). And `vmuludq_reg` in `x86.ad` just takes the immedate operands and pass them into `vpmuludq` which doesn't look right...
`vpmuludq` does a long multiplication but throws away the upper bits of the operands, effectively does a `(x & max_juint) * (y & max_juint)`
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1805887594
More information about the hotspot-compiler-dev
mailing list