RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v2]
Jasmine Karthikeyan
jkarthikeyan at openjdk.org
Tue Oct 15 17:03:15 UTC 2024
On Wed, 9 Oct 2024 09:59:11 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> This patch optimizes LongVector multiplication by inferring VPMULUDQ instruction for following IR pallets.
>>
>>
>> MulL ( And SRC1, 0xFFFFFFFF) ( And SRC2, 0xFFFFFFFF)
>> MulL (URShift SRC1 , 32) (URShift SRC2, 32)
>> MulL (URShift SRC1 , 32) ( And SRC2, 0xFFFFFFFF)
>> MulL ( And SRC1, 0xFFFFFFFF) (URShift SRC2 , 32)
>>
>>
>>
>> A 64x64 bit multiplication produces 128 bit result, and can be performed by individually multiplying upper and lower double word of multiplier with multiplicand and assembling the partial products to compute full width result. Targets supporting vector quadword multiplication have separate instructions to compute upper and lower quadwords for 128 bit result. Therefore existing VectorAPI multiplication operator expects shape conformance between source and result vectors.
>>
>> If upper 32 bits of quadword multiplier and multiplicand is always set to zero then result of multiplication is only dependent on the partial product of their lower double words and can be performed using unsigned 32 bit multiplication instruction with quadword saturation. Patch matches this pattern in a target dependent manner without introducing new IR node.
>>
>> VPMULUDQ instruction performs unsigned multiplication between even numbered doubleword lanes of two long vectors and produces 64 bit result. It has much lower latency compared to full 64 bit multiplication instruction "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword multiplication, thus we can save redundant partial product for zeroed out upper 32 bits. This results into throughput improvements on both P and E core Xeons.
>>
>> Please find below the performance of [XXH3 hashing benchmark ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included with the patch:-
>>
>>
>> Sierra Forest :-
>> ============
>> Baseline:-
>> Benchmark (SIZE) Mode Cnt Score Error Units
>> VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 806.228 ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 403.044 ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 200.641 ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 100.664 ops/ms
>>
>> With Optimization:-
>> Benchmark (SIZE) Mode Cnt Score Error Units
>> VectorXXH3HashingBenchmark.hashingKernel ...
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
>
> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
> - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction
I'm pretty ambivalent, I think implementing it either way would be alright. Especially with unit tests, I think the lowering implementation wouldn't be that difficult. Maybe another reviewer has an opinion?
About PhaseLowering though, I've found some more interesting things we could do with it, especially with improving vectorization support in the backend. @merykitty have you already started to work on it? I was thinking about prototyping it soon. Just wanted to make sure we're not doing the same work twice :)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2414553899
More information about the core-libs-dev
mailing list