RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Jasmine Karthikeyan jkarthikeyan at openjdk.org
Wed Nov 6 17:39:27 UTC 2024


On Sun, 29 Sep 2024 04:21:19 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ instruction for following IR pallets.
>   
> 
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF)   ( AndV  SRC2,  0xFFFFFFFF) 
>        MulVL   (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32)
>        MulVL   (URShiftVL SRC1 , 32)  ( AndV  SRC2,  0xFFFFFFFF)
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF) (URShiftVL SRC2 , 32)
>        MulVL   (VectorCastI2X SRC1) (VectorCastI2X SRC2)
>        MulVL   (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32)
> 
> 
> 
>  A  64x64 bit multiplication produces 128 bit result, and can be performed by individually multiplying upper and lower double word of multiplier with multiplicand and assembling the partial products to compute full width result. Targets supporting vector quadword multiplication have separate instructions to compute upper and lower quadwords for 128 bit result. Therefore existing VectorAPI multiplication operator expects shape conformance between source and result vectors.
> 
> If upper 32 bits of quadword multiplier and multiplicand is always set to zero then result of multiplication is only dependent on the partial product of their lower double words and can be performed using unsigned 32 bit multiplication instruction with quadword saturation. Patch matches this pattern in a target dependent manner without introducing new IR node.
>  
> VPMUL[U]DQ instruction performs [unsigned] multiplication between even numbered doubleword lanes of two long vectors and produces 64 bit result.  It has much lower latency compared to full 64 bit multiplication instruction "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword multiplication, thus we can save redundant partial product for zeroed out upper 32 bits. This results into throughput improvements on both P and E core Xeons.
> 
> Please find below the performance of [XXH3 hashing benchmark ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included with the patch:-
>  
> 
> Sierra Forest :-
> ============
> Baseline:-
> Benchmark                                 (SIZE)   Mode  Cnt    Score   Error   Units
> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228          ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044          ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641          ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664          ops/ms
> 
> With Optimization:-
> Benchmark                                 (SIZE)   Mode  ...

> Hi @jaskarth , Bigger pattern matching is sensitive to [IR level node sharing](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L1724), thus it may not full proof for above 4 patterns. Current patch takes care of this limitation.

I think this is a good point. I've taken a look at the patch and added some comments below.

Hmm, do you think this pattern could be matched in the ad-files instead of the middle end? I think that might be a lot cleaner since the backend already has systems for matching node trees, which could avoid a lot of the complexity here. I think it could make the patch a lot smaller and simpler.

For the record I think in this PR we could simply match the IR patterns in the ad file, since (from my understanding) the patterns we are matching could be supported there. We should do platform-specific lowering in a separate patch because it is pretty nuanced, and we could potentially move it to the new system afterwards.

I'm pretty ambivalent, I think implementing it either way would be alright. Especially with unit tests, I think the lowering implementation wouldn't be that difficult. Maybe another reviewer has an opinion?

About PhaseLowering though, I've found some more interesting things we could do with it, especially with improving vectorization support in the backend. @merykitty have you already started to work on it? I was thinking about prototyping it soon. Just wanted to make sure we're not doing the same work twice :)

BTW, from the last conversation I had started working on PhaseLowering myself, you can see my work so far on my branch: https://github.com/jaskarth/jdk/tree/phase-lowering. I think I can publish an RFE in the coming two or three days (there were some optimizations and cleanup I was prototyping, I will remove them before sending a PR.) Do you think we should continue with my branch or do you want to approach the problem from a different way? Just want to check again to make sure we don't end up re-doing the same work :)

src/hotspot/cpu/x86/matcher_x86.hpp line 184:

> 182:   // Does the CPU supports doubleword multiplication with quadword saturation.
> 183:   static constexpr bool supports_double_word_mult_with_quadword_staturation(void) {
> 184:     return true;

Should this be `UseAVX > 0`? I'm wondering since we have a `MulVL` rule that applies when `UseAVX == 0`.

src/hotspot/share/opto/vectornode.cpp line 2089:

> 2087:   if (Matcher::supports_double_word_mult_with_quadword_staturation() &&
> 2088:       !is_mult_lower_double_word()) {
> 2089:     auto is_clear_upper_double_word_uright_shift_op = [](const Node *n) {

Suggestion:

    auto is_clear_upper_double_word_uright_shift_op = [](const Node* n) {

src/hotspot/share/opto/vectornode.cpp line 2093:

> 2091:              n->in(2)->Opcode() == Op_RShiftCntV && n->in(2)->in(1)->is_Con() &&
> 2092:              n->in(2)->in(1)->bottom_type()->isa_int() &&
> 2093:              n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32L;

Suggestion:

             n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32;

Since you are comparing with a `TypeInt` I think this shouldn't be `32L`.

src/hotspot/share/opto/vectornode.cpp line 2098:

> 2096:     auto is_lower_double_word_and_mask_op = [](const Node *n) {
> 2097:       if (n->Opcode() == Op_AndV) {
> 2098:         Node *replicate_operand = n->in(1)->Opcode() == Op_Replicate ? n->in(1)

Suggestion:

        Node* replicate_operand = n->in(1)->Opcode() == Op_Replicate ? n->in(1)

src/hotspot/share/opto/vectornode.cpp line 2124:

> 2122:     // MulL ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
> 2123:     if ((is_lower_double_word_and_mask_op(in(1)) ||
> 2124:          is_lower_double_word_and_mask_op(in(1)) ||

`is_lower_double_word_and_mask_op(in(1)) || is_lower_double_word_and_mask_op(in(1))` is redundant, right? Shouldn't you only need it once? Same for the other 3 calls, which are similarly repeated.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 41:

> 39:  */
> 40: 
> 41: public class VectorMultiplyOpt {

Could it be possible to also do IR verification in this test? It would be good to check that we don't generate `AndVL` or `URShiftVL` with this transform.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 43:

> 41: public class VectorMultiplyOpt {
> 42: 
> 43:     public static long [] src1;

Suggestion:

    public static long[] src1;

And for the rest of the `long []` in this file too.

test/micro/org/openjdk/bench/jdk/incubator/vector/VectorXXH3HashingBenchmark.java line 39:

> 37:     @Param({"1024", "2048", "4096", "8192"})
> 38:     private int  SIZE;
> 39:     private long [] accumulators;

Suggestion:

    private long[] accumulators;

-------------

PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2367683334
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2407658405
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2411538179
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2414553899
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2422700344
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800159123
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153755
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153568
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153842
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800151177
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800167403
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800165261
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800169840


More information about the hotspot-compiler-dev mailing list