[vectorIntrinsics+mask] RFR: 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets. [v2]

Thu Aug 12 00:07:57 UTC 2021

On Fri, 6 Aug 2021 17:06:07 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Intel targets supporting AVX512 feature offer predicated vector instructions. These are vector operations on selected vector lanes under the influence of opmask register. For non-AVX512 targets, masked vector operations are supported using an explicit vector blend operation after main vector operation which does the needed selection. 
>> 
>> This patch adds initial X86 backed support for predicated vector operations. 
>> 
>> Following is performance data for existing VectorAPI JMH benchmarks with the patch:
>> Test System:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake server 40C 2S)
>> 
>> Benchmark | SIZE | Baseline (ops/ms) | WithOpts (ops/ms) | Gain
>> -- | -- | -- | -- | --
>> Int512Vector.ABSMasked | 1024 | 10132.664 | 10394.942 | 1.025884407
>> Int512Vector.ADDMasked | 1024 | 7785.805 | 8980.133 | 1.153398139
>> Int512Vector.ADDMaskedLanes | 1024 | 5809.455 | 6350.628 | 1.093153833
>> Int512Vector.ANDMasked | 1024 | 7776.729 | 8965.988 | 1.152925349
>> Int512Vector.ANDMaskedLanes | 1024 | 6717.202 | 7426.217 | 1.105552133
>> Int512Vector.AND_NOTMasked | 1024 | 7688.835 | 8988.659 | 1.169053439
>> Int512Vector.ASHRMasked | 1024 | 6808.185 | 7883.755 | 1.1579819
>> Int512Vector.ASHRMaskedShift | 1024 | 9523.164 | 12166.72 | 1.277592195
>> Int512Vector.BITWISE_BLENDMasked | 1024 | 5919.647 | 6864.988 | 1.159695502
>> Int512Vector.DIVMasked | 1024 | 237.174 | 236.014 | 0.995109076
>> Int512Vector.FIRST_NONZEROMasked | 1024 | 5387.315 | 7890.42 | 1.464629412
>> Int512Vector.LSHLMasked | 1024 | 6806.898 | 7881.315 | 1.157842383
>> Int512Vector.LSHLMaskedShift | 1024 | 9552.257 | 12153.769 | 1.272345269
>> Int512Vector.LSHRMasked | 1024 | 6776.605 | 7897.786 | 1.165448776
>> Int512Vector.LSHRMaskedShift | 1024 | 9500.087 | 12134.962 | 1.277352723
>> Int512Vector.MAXMaskedLanes | 1024 | 6993.149 | 7580.399 | 1.083975045
>> Int512Vector.MINMaskedLanes | 1024 | 6925.363 | 7450.814 | 1.075873424
>> Int512Vector.MULMasked | 1024 | 7732.753 | 8956.02 | 1.158192949
>> Int512Vector.MULMaskedLanes | 1024 | 4066.384 | 4152.375 | 1.021146798
>> Int512Vector.NEGMasked | 1024 | 8760.797 | 9255.063 | 1.056417926
>> Int512Vector.NOTMasked | 1024 | 8981.123 | 9229.573 | 1.027663578
>> Int512Vector.ORMasked | 1024 | 7786.787 | 8967.057 | 1.151573428
>> Int512Vector.ORMaskedLanes | 1024 | 6694.36 | 7450.106 | 1.112892943
>> Int512Vector.SUBMasked | 1024 | 7782.939 | 9001.692 | 1.156592901
>> Int512Vector.XORMasked | 1024 | 7785.031 | 9070.342 | 1.165100306
>> Int512Vector.XORMaskedLanes | 1024 | 6700.689 | 7454.73 | 1.112531861
>> Int512Vector.ZOMOMasked | 1024 | 6982.297 | 8313.51 | 1.190655453
>> Int512Vector.gatherMasked | 1024 | 361.497 | 1494.876 | 4.135237637
>> Int512Vector.scatterMasked | 1024 | 490.05 | 3120.425 | 6.367564534
>> Int512Vector.sliceMasked | 1024 | 1436.248 | 1597.805 | 1.112485448
>> Int512Vector.unsliceMasked | 1024 | 296.721 | 346.434 | 1.167541226
>> Float512Vector.ADDMasked | 1024 | 7645.873 | 9123.386 | 1.193243205
>> Float512Vector.ADDMaskedLanes | 1024 | 2404.371 | 2529.284 | 1.051952465
>> Float512Vector.DIVMasked | 1024 | 5134.602 | 5129.085 | 0.998925525
>> Float512Vector.FIRST_NONZEROMasked | 1024 | 5040.567 | 7078.828 | 1.404371373
>> Float512Vector.FMAMasked | 1024 | 5996.419 | 6902.626 | 1.151124696
>> Float512Vector.MAXMaskedLanes | 1024 | 1681.249 | 1727.444 | 1.027476596
>> Float512Vector.MINMaskedLanes | 1024 | 1610.115 | 1667.143 | 1.035418588
>> Float512Vector.MULMasked | 1024 | 7812.317 | 9054.137 | 1.158956683
>> Float512Vector.MULMaskedLanes | 1024 | 2406.81 | 2514.018 | 1.044543608
>> Float512Vector.NEGMasked | 1024 | 8248.933 | 9834.607 | 1.192227771
>> Float512Vector.SQRTMasked | 1024 | 4278.046 | 4281.009 | 1.000692606
>> Float512Vector.SUBMasked | 1024 | 7697.582 | 9044.305 | 1.174954031
>> Float512Vector.gatherMasked | 1024 | 428.428 | 1491.441 | 3.48119404
>> Float512Vector.scatterMasked | 1024 | 416.169 | 3216.628 | 7.729138883
>> Float512Vector.sliceMasked | 1024 | 1431.07 | 1609.12 | 1.124417394
>> Float512Vector.unsliceMasked | 1024 | 292.513 | 331.366 | 1.132824866
>> 
>> 
>> 
>> PS: Above data shows the performance gains for two vector species Int512, Float512.  In general for all the species we see 1.2-2.x gains on various masking operation supported uptill now.
>> New matcher routine `Matcher::match_rule_supported_vector_masked`   lists making operations supported by this patch.
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
> 
>  - 8270349: Merge with latest vectorIntrinsics+mask tip + extend backend support for XorV,AndV,OrV and Compare masked operations.
>  - 8270349: Fix for 32-bit build failure.
>  - 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets.

src/hotspot/cpu/x86/assembler_x86.cpp line 7569:

> 7567: }
> 7568: 
> 7569: void Assembler::evpxord(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

Most of the added instruction are very similar. Lot of duplication of code. Could be modularized for easy maintenance and review.

src/hotspot/cpu/x86/assembler_x86.cpp line 7585:

> 7583: 
> 7584: void Assembler::evpxorq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 7585:   assert(VM_Version::supports_evex(), "");

The following assert is missing from this and similar instruction:
assert(vector_len == AVX_512bit || VM_Version::supports_avx512vl(), "");
Please make sure that the asserts are similar.

src/hotspot/cpu/x86/assembler_x86.cpp line 7592:

> 7590:   if (merge) {
> 7591:     attributes.reset_is_clear_context();
> 7592:   }

Isn't this needed only for instructions with memory operand?

src/hotspot/cpu/x86/assembler_x86.cpp line 8310:

> 8308: 
> 8309: void Assembler::evpaddq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 8310:   InstructionMark im(this);

Don't need InstructionMark for register only instructions.  Multiple similar instances.

src/hotspot/cpu/x86/assembler_x86.cpp line 8561:

> 8559: }
> 8560: 
> 8561: void Assembler::evpmulw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

This should be evpmullw.

src/hotspot/cpu/x86/assembler_x86.cpp line 8574:

> 8572: }
> 8573: 
> 8574: void Assembler::evpmulw(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

This should be evpmullw.

src/hotspot/cpu/x86/assembler_x86.cpp line 8589:

> 8587: }
> 8588: 
> 8589: void Assembler::evpmuld(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

This should be evpmulld.

src/hotspot/cpu/x86/assembler_x86.cpp line 8602:

> 8600: }
> 8601: 
> 8602: void Assembler::evpmuld(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

This should be evpmulld.

src/hotspot/cpu/x86/assembler_x86.cpp line 8617:

> 8615: }
> 8616: 
> 8617: void Assembler::evpmulq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

This should be evpmullq.

src/hotspot/cpu/x86/assembler_x86.cpp line 8630:

> 8628: }
> 8629: 
> 8630: void Assembler::evpmulq(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

This should be evpmullq.

src/hotspot/cpu/x86/assembler_x86.cpp line 8877:

> 8875: }
> 8876: 
> 8877: void Assembler::evfmaps(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

It will be good to specify the flavor of fma here, say evfmaps213ps based on the opcode that you use.
Another point is that the 213 flavor does the following operation:
   dst = src + dst * nds;
Wouldn't the 231 flavor be better?

src/hotspot/cpu/x86/assembler_x86.cpp line 8890:

> 8888: }
> 8889: 
> 8890: void Assembler::evfmaps(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

It will be good to specify the flavor of fma here, say evfmaps213ps based on the opcode that you use.

src/hotspot/cpu/x86/assembler_x86.cpp line 8905:

> 8903: }
> 8904: 
> 8905: void Assembler::evfmapd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

It will be good to specify the flavor of fma here, say evfmaps213pd based on the opcode that you use.

src/hotspot/cpu/x86/assembler_x86.cpp line 8918:

> 8916: }
> 8917: 
> 8918: void Assembler::evfmapd(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

It will be good to specify the flavor of fma here, say evfmaps213pd based on the opcode that you use.

src/hotspot/cpu/x86/assembler_x86.cpp line 8933:

> 8931: }
> 8932: 
> 8933: void Assembler::evppermb(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

This should be evpermb.

src/hotspot/cpu/x86/assembler_x86.cpp line 8946:

> 8944: }
> 8945: 
> 8946: void Assembler::evppermb(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

This should be evpermb.

src/hotspot/cpu/x86/assembler_x86.cpp line 8960:

> 8958: }
> 8959: 
> 8960: void Assembler::evppermw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

This should be evpermw.

src/hotspot/cpu/x86/assembler_x86.cpp line 8973:

> 8971: }
> 8972: 
> 8973: void Assembler::evppermw(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

This should be evpermw.

src/hotspot/cpu/x86/assembler_x86.cpp line 8987:

> 8985: }
> 8986: 
> 8987: void Assembler::evppermd(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

This should be evpermd.

src/hotspot/cpu/x86/assembler_x86.cpp line 9000:

> 8998: }
> 8999: 
> 9000: void Assembler::evppermd(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

This should be evpermd.

src/hotspot/cpu/x86/assembler_x86.cpp line 9014:

> 9012: }
> 9013: 
> 9014: void Assembler::evppermq(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

This should be evpermq.

src/hotspot/cpu/x86/assembler_x86.cpp line 9027:

> 9025: }
> 9026: 
> 9027: void Assembler::evppermq(XMMRegister dst, KRegister mask, XMMRegister nds, Address src, bool merge, int vector_len) {

This should be evpermq.

src/hotspot/cpu/x86/assembler_x86.cpp line 9043:

> 9041: void Assembler::evpsllw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9042:   assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9043:   InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);

So far WIG is encoded with vex_w as false. We could keep that consistent here and set vex_w to false.

src/hotspot/cpu/x86/assembler_x86.cpp line 9079:

> 9077: void Assembler::evpsrlw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9078:   assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9079:   InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);

So far WIG is encoded with vex_w as false. We could keep that consistent here and set vex_w to false.

src/hotspot/cpu/x86/assembler_x86.cpp line 9115:

> 9113: void Assembler::evpsraw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9114:   assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9115:   InstructionAttr attributes(vector_len, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);

So far WIG is encoded with vex_w as false. We could keep that consistent here and set vex_w to false.

src/hotspot/cpu/x86/assembler_x86.cpp line 9259:

> 9257: void Assembler::evpminsb(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {
> 9258:   assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
> 9259:   InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ false, /* uses_vl */ true);