RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v2]
Vladimir Ivanov
vlivanov at openjdk.org
Fri Nov 8 20:28:07 UTC 2024
On Fri, 8 Nov 2024 08:15:32 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ instruction for following IR pallets.
>>
>>
>> MulVL ( AndV SRC1, 0xFFFFFFFF) ( AndV SRC2, 0xFFFFFFFF)
>> MulVL (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32)
>> MulVL (URShiftVL SRC1 , 32) ( AndV SRC2, 0xFFFFFFFF)
>> MulVL ( AndV SRC1, 0xFFFFFFFF) (URShiftVL SRC2 , 32)
>> MulVL (VectorCastI2X SRC1) (VectorCastI2X SRC2)
>> MulVL (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32)
>>
>>
>>
>> A 64x64 bit multiplication produces 128 bit result, and can be performed by individually multiplying upper and lower double word of multiplier with multiplicand and assembling the partial products to compute full width result. Targets supporting vector quadword multiplication have separate instructions to compute upper and lower quadwords for 128 bit result. Therefore existing VectorAPI multiplication operator expects shape conformance between source and result vectors.
>>
>> If upper 32 bits of quadword multiplier and multiplicand is always set to zero then result of multiplication is only dependent on the partial product of their lower double words and can be performed using unsigned 32 bit multiplication instruction with quadword saturation. Patch matches this pattern in a target dependent manner without introducing new IR node.
>>
>> VPMUL[U]DQ instruction performs [unsigned] multiplication between even numbered doubleword lanes of two long vectors and produces 64 bit result. It has much lower latency compared to full 64 bit multiplication instruction "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword multiplication, thus we can save redundant partial product for zeroed out upper 32 bits. This results into throughput improvements on both P and E core Xeons.
>>
>> Please find below the performance of [XXH3 hashing benchmark ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included with the patch:-
>>
>>
>> Sierra Forest :-
>> ============
>> Baseline:-
>> Benchmark (SIZE) Mode Cnt Score Error Units
>> VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 806.228 ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 403.044 ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 200.641 ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 100.664 ops/ms
>>
>> With Optimizati...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Creating specialized IR to shield pattern from subsequent transforms in optimization pipeline
In the latest version you added new Ideal nodes (`MulIVL` and `MulUIVL`). I don't see a compelling reason to do so. IMO matcher functionality is more than enough to cover `VPMULDQ` case. `MulIVL` is equivalent to `MulVL` + `has_int_inputs()` predicate. For `MulUIVL` you additionally do input rewiring (using `forward_masked_input`), but (1) `AndV src (Replicate 0xFFFFFFFF))` operands can be easily detected on matcher side (with an extra AD instruction); and (2) such optimization is limited because it is valid only for `0xFFFFFFFF` case while `has_uint_inputs() == true` for `C <= 0xFFFFFFFF`.
So, IMO `MulIVL` and `MulUIVL` nodes just add noise in Ideal graph without improving situation during matching.
src/hotspot/share/opto/vectornode.cpp line 2132:
> 2130: // Directly forward masked inputs if
> 2131: if (n->Opcode() == Op_AndV) {
> 2132: return n->in(1)->Opcode() == Op_Replicate ? n->in(2) : n->in(1);
This particular check should ensure that Replicate constant is `0xFFFFFFFF`.
-------------
PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2424864897
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1835023354
More information about the hotspot-compiler-dev
mailing list