RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v4]

Thu Nov 14 03:04:57 UTC 2024

On Wed, 13 Nov 2024 02:43:12 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ instruction for following IR pallets.
>>   
>> 
>>        MulVL   ( AndV  SRC1,  0xFFFFFFFF)   ( AndV  SRC2,  0xFFFFFFFF) 
>>        MulVL   (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32)
>>        MulVL   (URShiftVL SRC1 , 32)  ( AndV  SRC2,  0xFFFFFFFF)
>>        MulVL   ( AndV  SRC1,  0xFFFFFFFF) (URShiftVL SRC2 , 32)
>>        MulVL   (VectorCastI2X SRC1) (VectorCastI2X SRC2)
>>        MulVL   (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32)
>> 
>> 
>> 
>>  A  64x64 bit multiplication produces 128 bit result, and can be performed by individually multiplying upper and lower double word of multiplier with multiplicand and assembling the partial products to compute full width result. Targets supporting vector quadword multiplication have separate instructions to compute upper and lower quadwords for 128 bit result. Therefore existing VectorAPI multiplication operator expects shape conformance between source and result vectors.
>> 
>> If upper 32 bits of quadword multiplier and multiplicand is always set to zero then result of multiplication is only dependent on the partial product of their lower double words and can be performed using unsigned 32 bit multiplication instruction with quadword saturation. Patch matches this pattern in a target dependent manner without introducing new IR node.
>>  
>> VPMUL[U]DQ instruction performs [unsigned] multiplication between even numbered doubleword lanes of two long vectors and produces 64 bit result.  It has much lower latency compared to full 64 bit multiplication instruction "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword multiplication, thus we can save redundant partial product for zeroed out upper 32 bits. This results into throughput improvements on both P and E core Xeons.
>> 
>> Please find below the performance of [XXH3 hashing benchmark ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included with the patch:-
>>  
>> 
>> Sierra Forest :-
>> ============
>> Baseline:-
>> Benchmark                                 (SIZE)   Mode  Cnt    Score   Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228          ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044          ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641          ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664          ops/ms
>> 
>> With Optimizati...
>
> Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. Incremental views are not available. The pull request now contains seven commits:
> 
>  - Removing target specific hooks
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
>  - Review resoultions
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
>  - Handle new I2L pattern, IR tests, Rewiring pattern inputs to MulVL further optimizes JIT code
>  - Review resolutions
>  - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

Overall, looks good.

Some minor refactoring suggestions follow.

src/hotspot/share/opto/vectornode.cpp line 2092:

> 2090:          n->in(1)->is_Con() &&
> 2091:          n->in(1)->bottom_type()->isa_long() &&
> 2092:          n->in(1)->bottom_type()->is_long()->get_con() <= 4294967295L;

Is it clearer to use `0xFFFFFFFFL` representation here?

src/hotspot/share/opto/vectornode.cpp line 2114:

> 2112: 
> 2113: static bool has_vector_elements_fit_int(Node* n) {
> 2114:   auto is_cast_integer_to_long_pattern = [](const Node* n) {

I like how you use lambda expressions for node predicates. 
Please, shape `has_vector_elements_fit_uint()` in a similar fashion.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 51:

> 49: 
> 50:     public static final int SIZE = 1024;
> 51:     public static final Random r = new Random(1024);

For reproducibility purposes, it's better to use `jdk.test.lib.Utils.getRandomInstance()`. It reports the seed and supports overriding it.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 105:

> 103:             LongVector vsrc2 = LongVector.fromArray(LSP, lsrc2, i);
> 104:             vsrc1.lanewise(VectorOperators.AND, 0xFFFFFFFFL)
> 105:                  .lanewise(VectorOperators.MUL, vsrc2.lanewise(VectorOperators.AND, 0xFFFFFFFFL))

It would be nice to randomize the constants (masks and shifts) to improve test coverage.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2434942773
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841495626
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841496794
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841491826
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841492208