[vectorIntrinsics+mask] RFR: 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets. [v2]
Sandhya Viswanathan
sviswanathan at openjdk.java.net
Thu Aug 12 00:07:57 UTC 2021
On Fri, 6 Aug 2021 17:06:07 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Intel targets supporting AVX512 feature offer predicated vector instructions. These are vector operations on selected vector lanes under the influence of opmask register. For non-AVX512 targets, masked vector operations are supported using an explicit vector blend operation after main vector operation which does the needed selection.
>>
>> This patch adds initial X86 backed support for predicated vector operations.
>>
>> Following is performance data for existing VectorAPI JMH benchmarks with the patch:
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake server 40C 2S)
>>
>> Benchmark | SIZE | Baseline (ops/ms) | WithOpts (ops/ms) | Gain
>> -- | -- | -- | -- | --
>> Int512Vector.ABSMasked | 1024 | 10132.664 | 10394.942 | 1.025884407
>> Int512Vector.ADDMasked | 1024 | 7785.805 | 8980.133 | 1.153398139
>> Int512Vector.ADDMaskedLanes | 1024 | 5809.455 | 6350.628 | 1.093153833
>> Int512Vector.ANDMasked | 1024 | 7776.729 | 8965.988 | 1.152925349
>> Int512Vector.ANDMaskedLanes | 1024 | 6717.202 | 7426.217 | 1.105552133
>> Int512Vector.AND_NOTMasked | 1024 | 7688.835 | 8988.659 | 1.169053439
>> Int512Vector.ASHRMasked | 1024 | 6808.185 | 7883.755 | 1.1579819
>> Int512Vector.ASHRMaskedShift | 1024 | 9523.164 | 12166.72 | 1.277592195
>> Int512Vector.BITWISE_BLENDMasked | 1024 | 5919.647 | 6864.988 | 1.159695502
>> Int512Vector.DIVMasked | 1024 | 237.174 | 236.014 | 0.995109076
>> Int512Vector.FIRST_NONZEROMasked | 1024 | 5387.315 | 7890.42 | 1.464629412
>> Int512Vector.LSHLMasked | 1024 | 6806.898 | 7881.315 | 1.157842383
>> Int512Vector.LSHLMaskedShift | 1024 | 9552.257 | 12153.769 | 1.272345269
>> Int512Vector.LSHRMasked | 1024 | 6776.605 | 7897.786 | 1.165448776
>> Int512Vector.LSHRMaskedShift | 1024 | 9500.087 | 12134.962 | 1.277352723
>> Int512Vector.MAXMaskedLanes | 1024 | 6993.149 | 7580.399 | 1.083975045
>> Int512Vector.MINMaskedLanes | 1024 | 6925.363 | 7450.814 | 1.075873424
>> Int512Vector.MULMasked | 1024 | 7732.753 | 8956.02 | 1.158192949
>> Int512Vector.MULMaskedLanes | 1024 | 4066.384 | 4152.375 | 1.021146798
>> Int512Vector.NEGMasked | 1024 | 8760.797 | 9255.063 | 1.056417926
>> Int512Vector.NOTMasked | 1024 | 8981.123 | 9229.573 | 1.027663578
>> Int512Vector.ORMasked | 1024 | 7786.787 | 8967.057 | 1.151573428
>> Int512Vector.ORMaskedLanes | 1024 | 6694.36 | 7450.106 | 1.112892943
>> Int512Vector.SUBMasked | 1024 | 7782.939 | 9001.692 | 1.156592901
>> Int512Vector.XORMasked | 1024 | 7785.031 | 9070.342 | 1.165100306
>> Int512Vector.XORMaskedLanes | 1024 | 6700.689 | 7454.73 | 1.112531861
>> Int512Vector.ZOMOMasked | 1024 | 6982.297 | 8313.51 | 1.190655453
>> Int512Vector.gatherMasked | 1024 | 361.497 | 1494.876 | 4.135237637
>> Int512Vector.scatterMasked | 1024 | 490.05 | 3120.425 | 6.367564534
>> Int512Vector.sliceMasked | 1024 | 1436.248 | 1597.805 | 1.112485448
>> Int512Vector.unsliceMasked | 1024 | 296.721 | 346.434 | 1.167541226
>> Float512Vector.ADDMasked | 1024 | 7645.873 | 9123.386 | 1.193243205
>> Float512Vector.ADDMaskedLanes | 1024 | 2404.371 | 2529.284 | 1.051952465
>> Float512Vector.DIVMasked | 1024 | 5134.602 | 5129.085 | 0.998925525
>> Float512Vector.FIRST_NONZEROMasked | 1024 | 5040.567 | 7078.828 | 1.404371373
>> Float512Vector.FMAMasked | 1024 | 5996.419 | 6902.626 | 1.151124696
>> Float512Vector.MAXMaskedLanes | 1024 | 1681.249 | 1727.444 | 1.027476596
>> Float512Vector.MINMaskedLanes | 1024 | 1610.115 | 1667.143 | 1.035418588
>> Float512Vector.MULMasked | 1024 | 7812.317 | 9054.137 | 1.158956683
>> Float512Vector.MULMaskedLanes | 1024 | 2406.81 | 2514.018 | 1.044543608
>> Float512Vector.NEGMasked | 1024 | 8248.933 | 9834.607 | 1.192227771
>> Float512Vector.SQRTMasked | 1024 | 4278.046 | 4281.009 | 1.000692606
>> Float512Vector.SUBMasked | 1024 | 7697.582 | 9044.305 | 1.174954031
>> Float512Vector.gatherMasked | 1024 | 428.428 | 1491.441 | 3.48119404
>> Float512Vector.scatterMasked | 1024 | 416.169 | 3216.628 | 7.729138883
>> Float512Vector.sliceMasked | 1024 | 1431.07 | 1609.12 | 1.124417394
>> Float512Vector.unsliceMasked | 1024 | 292.513 | 331.366 | 1.132824866
>>
>>
>>
>> PS: Above data shows the performance gains for two vector species Int512, Float512. In general for all the species we see 1.2-2.x gains on various masking operation supported uptill now.
>> New matcher routine `Matcher::match_rule_supported_vector_masked` lists making operations supported by this patch.
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
>
> - 8270349: Merge with latest vectorIntrinsics+mask tip + extend backend support for XorV,AndV,OrV and Compare masked operations.
> - 8270349: Fix for 32-bit build failure.
> - 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets.
src/hotspot/cpu/x86/assembler_x86.cpp line 7569:
> 7567: }
> 7568:
> 7569: void Assembler::evpxord(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
Most of the added instruction are very similar. Lot of duplication of code. Could be modularized for easy maintenance and review.
src/hotspot/cpu/x86/assembler_x86.cpp line 7585:
> 7583:
> 7584: void Assembler::evpxorq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 7585: assert(VM_Version::supports_evex(), "");
The following assert is missing from this and similar instruction:
assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
Please make sure that the asserts are similar.
src/hotspot/cpu/x86/assembler_x86.cpp line 7592:
> 7590: if (merge) {
> 7591: attributes.reset_is_clear_context();
> 7592: }
Isn't this needed only for instructions with memory operand?
src/hotspot/cpu/x86/assembler_x86.cpp line 8310:
> 8308:
> 8309: void Assembler::evpaddq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 8310: InstructionMark im(this);
Don't need InstructionMark for register only instructions. Multiple similar instances.
src/hotspot/cpu/x86/assembler_x86.cpp line 8561:
> 8559: }
> 8560:
> 8561: void Assembler::evpmulw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
This should be evpmullw.
src/hotspot/cpu/x86/assembler_x86.cpp line 8574:
> 8572: }
> 8573:
> 8574: void Assembler::evpmulw(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
This should be evpmullw.
src/hotspot/cpu/x86/assembler_x86.cpp line 8589:
> 8587: }
> 8588:
> 8589: void Assembler::evpmuld(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
This should be evpmulld.
src/hotspot/cpu/x86/assembler_x86.cpp line 8602:
> 8600: }
> 8601:
> 8602: void Assembler::evpmuld(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
This should be evpmulld.
src/hotspot/cpu/x86/assembler_x86.cpp line 8617:
> 8615: }
> 8616:
> 8617: void Assembler::evpmulq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
This should be evpmullq.
src/hotspot/cpu/x86/assembler_x86.cpp line 8630:
> 8628: }
> 8629:
> 8630: void Assembler::evpmulq(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
This should be evpmullq.
src/hotspot/cpu/x86/assembler_x86.cpp line 8877:
> 8875: }
> 8876:
> 8877: void Assembler::evfmaps(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
It will be good to specify the flavor of fma here, say evfmaps213ps based on the opcode that you use.
Another point is that the 213 flavor does the following operation:
dst = src + dst * nds;
Wouldn't the 231 flavor be better?
src/hotspot/cpu/x86/assembler_x86.cpp line 8890:
> 8888: }
> 8889:
> 8890: void Assembler::evfmaps(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
It will be good to specify the flavor of fma here, say evfmaps213ps based on the opcode that you use.
src/hotspot/cpu/x86/assembler_x86.cpp line 8905:
> 8903: }
> 8904:
> 8905: void Assembler::evfmapd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
It will be good to specify the flavor of fma here, say evfmaps213pd based on the opcode that you use.
src/hotspot/cpu/x86/assembler_x86.cpp line 8918:
> 8916: }
> 8917:
> 8918: void Assembler::evfmapd(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
It will be good to specify the flavor of fma here, say evfmaps213pd based on the opcode that you use.
src/hotspot/cpu/x86/assembler_x86.cpp line 8933:
> 8931: }
> 8932:
> 8933: void Assembler::evppermb(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
This should be evpermb.
src/hotspot/cpu/x86/assembler_x86.cpp line 8946:
> 8944: }
> 8945:
> 8946: void Assembler::evppermb(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
This should be evpermb.
src/hotspot/cpu/x86/assembler_x86.cpp line 8960:
> 8958: }
> 8959:
> 8960: void Assembler::evppermw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
This should be evpermw.
src/hotspot/cpu/x86/assembler_x86.cpp line 8973:
> 8971: }
> 8972:
> 8973: void Assembler::evppermw(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
This should be evpermw.
src/hotspot/cpu/x86/assembler_x86.cpp line 8987:
> 8985: }
> 8986:
> 8987: void Assembler::evppermd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
This should be evpermd.
src/hotspot/cpu/x86/assembler_x86.cpp line 9000:
> 8998: }
> 8999:
> 9000: void Assembler::evppermd(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
This should be evpermd.
src/hotspot/cpu/x86/assembler_x86.cpp line 9014:
> 9012: }
> 9013:
> 9014: void Assembler::evppermq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
This should be evpermq.
src/hotspot/cpu/x86/assembler_x86.cpp line 9027:
> 9025: }
> 9026:
> 9027: void Assembler::evppermq(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {
This should be evpermq.
src/hotspot/cpu/x86/assembler_x86.cpp line 9043:
> 9041: void Assembler::evpsllw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9042: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9043: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);
So far WIG is encoded with vex_w as false. We could keep that consistent here and set vex_w to false.
src/hotspot/cpu/x86/assembler_x86.cpp line 9079:
> 9077: void Assembler::evpsrlw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9078: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9079: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);
So far WIG is encoded with vex_w as false. We could keep that consistent here and set vex_w to false.
src/hotspot/cpu/x86/assembler_x86.cpp line 9115:
> 9113: void Assembler::evpsraw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9114: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9115: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);
So far WIG is encoded with vex_w as false. We could keep that consistent here and set vex_w to false.
src/hotspot/cpu/x86/assembler_x86.cpp line 9259:
> 9257: void Assembler::evpminsb(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9258: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9259: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9285:
> 9283: void Assembler::evpminsw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9284: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9285: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9291:
> 9289: attributes.reset_is_clear_context();
> 9290: }
> 9291: int encode = vex_prefix_and_encode(dst->encoding(), nds->encoding(), src->encoding(), VEX_SIMD_66, VEX_OPCODE_0F_38, &attributes);
This should be VEX_OPCODE_0F.
src/hotspot/cpu/x86/assembler_x86.cpp line 9304:
> 9302: attributes.reset_is_clear_context();
> 9303: }
> 9304: vex_prefix(src, nds->encoding(), dst->encoding(), VEX_SIMD_66, VEX_OPCODE_0F_38, &attributes);
This should be VEX_OPCODE_0F.
src/hotspot/cpu/x86/assembler_x86.cpp line 9311:
> 9309: void Assembler::evpminsd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9310: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
> 9311: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9337:
> 9335: void Assembler::evpminsq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9336: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
> 9337: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9364:
> 9362: void Assembler::evpmaxsb(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9363: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9364: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9377:
> 9375: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9376: InstructionMark im(this);
> 9377: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);
So far WIG is encoded with vex_w as false. We could keep that consistent here and set vex_w to false.
src/hotspot/cpu/x86/assembler_x86.cpp line 9390:
> 9388: void Assembler::evpmaxsw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9389: assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9390: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9409:
> 9407: attributes.reset_is_clear_context();
> 9408: }
> 9409: vex_prefix(src, nds->encoding(), dst->encoding(), VEX_SIMD_66, VEX_OPCODE_0F_38, &attributes);
VEX_OPCODE_OF_38 should be VEX_OPCODE_0F.
src/hotspot/cpu/x86/assembler_x86.cpp line 9416:
> 9414: void Assembler::evpmaxsd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9415: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
> 9416: InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 9442:
> 9440: void Assembler::evpmaxsq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9441: assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
> 9442: InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);
legacy_mode should be false here.
src/hotspot/cpu/x86/assembler_x86.cpp line 10388:
> 10386: _attributes->set_evex_encoding(evex_encoding);
> 10387:
> 10388: // P0: byte 2, initialized to RXBR00mm
This should be R'RXB00mm.
src/hotspot/cpu/x86/assembler_x86.cpp line 10410:
> 10408: 0 :
> 10409: _attributes->get_embedded_opmask_register_specifier();
> 10410: // EVEX.v for extending EVEX.vvvv or VIDX
This should be EVEX.v`.
-------------
PR: https://git.openjdk.java.net/panama-vector/pull/99
More information about the panama-dev
mailing list