RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v2]

Mon Oct 14 21:53:15 UTC 2024

On Wed, 9 Oct 2024 09:59:11 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> This patch optimizes LongVector multiplication by inferring VPMULUDQ instruction for following IR pallets.
>>   
>> 
>>        MulL   ( And  SRC1,  0xFFFFFFFF)   ( And  SRC2,  0xFFFFFFFF) 
>>        MulL   (URShift SRC1 , 32) (URShift SRC2, 32)
>>        MulL   (URShift SRC1 , 32)  ( And  SRC2,  0xFFFFFFFF)
>>        MulL   ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
>> 
>> 
>> 
>>  A  64x64 bit multiplication produces 128 bit result, and can be performed by individually multiplying upper and lower double word of multiplier with multiplicand and assembling the partial products to compute full width result. Targets supporting vector quadword multiplication have separate instructions to compute upper and lower quadwords for 128 bit result. Therefore existing VectorAPI multiplication operator expects shape conformance between source and result vectors.
>> 
>> If upper 32 bits of quadword multiplier and multiplicand is always set to zero then result of multiplication is only dependent on the partial product of their lower double words and can be performed using unsigned 32 bit multiplication instruction with quadword saturation. Patch matches this pattern in a target dependent manner without introducing new IR node.
>>  
>> VPMULUDQ instruction performs unsigned multiplication between even numbered doubleword lanes of two long vectors and produces 64 bit result.  It has much lower latency compared to full 64 bit multiplication instruction "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword multiplication, thus we can save redundant partial product for zeroed out upper 32 bits. This results into throughput improvements on both P and E core Xeons.
>> 
>> Please find below the performance of [XXH3 hashing benchmark ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included with the patch:-
>>  
>> 
>> Sierra Forest :-
>> ============
>> Baseline:-
>> Benchmark                                 (SIZE)   Mode  Cnt    Score   Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228          ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044          ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641          ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664          ops/ms
>> 
>> With Optimization:-
>> Benchmark                                 (SIZE)   Mode  Cnt     Score   Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel   ...
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
> 
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
>  - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

> Hi @jaskarth , Bigger pattern matching is sensitive to [IR level node sharing](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L1724), thus it may not full proof for above 4 patterns. Current patch takes care of this limitation.

I think this is a good point. I've taken a look at the patch and added some comments below.

src/hotspot/cpu/x86/matcher_x86.hpp line 184:

> 182:   // Does the CPU supports doubleword multiplication with quadword saturation.
> 183:   static constexpr bool supports_double_word_mult_with_quadword_staturation(void) {
> 184:     return true;

Should this be `UseAVX > 0`? I'm wondering since we have a `MulVL` rule that applies when `UseAVX == 0`.

src/hotspot/share/opto/vectornode.cpp line 2089:

> 2087:   if (Matcher::supports_double_word_mult_with_quadword_staturation() &&
> 2088:       !is_mult_lower_double_word()) {
> 2089:     auto is_clear_upper_double_word_uright_shift_op = [](const Node *n) {

Suggestion:

    auto is_clear_upper_double_word_uright_shift_op = [](const Node* n) {

src/hotspot/share/opto/vectornode.cpp line 2093:

> 2091:              n->in(2)->Opcode() == Op_RShiftCntV && n->in(2)->in(1)->is_Con() &&
> 2092:              n->in(2)->in(1)->bottom_type()->isa_int() &&
> 2093:              n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32L;

Suggestion:

             n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32;

Since you are comparing with a `TypeInt` I think this shouldn't be `32L`.

src/hotspot/share/opto/vectornode.cpp line 2098:

> 2096:     auto is_lower_double_word_and_mask_op = [](const Node *n) {
> 2097:       if (n->Opcode() == Op_AndV) {
> 2098:         Node *replicate_operand = n->in(1)->Opcode() == Op_Replicate ? n->in(1)

Suggestion:

        Node* replicate_operand = n->in(1)->Opcode() == Op_Replicate ? n->in(1)

src/hotspot/share/opto/vectornode.cpp line 2124:

> 2122:     // MulL ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
> 2123:     if ((is_lower_double_word_and_mask_op(in(1)) ||
> 2124:          is_lower_double_word_and_mask_op(in(1)) ||

`is_lower_double_word_and_mask_op(in(1)) || is_lower_double_word_and_mask_op(in(1))` is redundant, right? Shouldn't you only need it once? Same for the other 3 calls, which are similarly repeated.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 41:

> 39:  */
> 40: 
> 41: public class VectorMultiplyOpt {

Could it be possible to also do IR verification in this test? It would be good to check that we don't generate `AndVL` or `URShiftVL` with this transform.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 43:

> 41: public class VectorMultiplyOpt {
> 42: 
> 43:     public static long [] src1;

Suggestion:

    public static long[] src1;

And for the rest of the `long []` in this file too.

test/micro/org/openjdk/bench/jdk/incubator/vector/VectorXXH3HashingBenchmark.java line 39:

> 37:     @Param({"1024", "2048", "4096", "8192"})
> 38:     private int  SIZE;
> 39:     private long [] accumulators;

Suggestion:

    private long[] accumulators;

-------------

PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2367683334
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800159123
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153755
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153568
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153842
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800151177
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800167403
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800165261
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800169840