[vectorIntrinsics+mask] RFR: 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets. [v3]

Thu Aug 12 12:26:39 UTC 2021

On Thu, 12 Aug 2021 12:17:50 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Intel targets supporting AVX512 feature offer predicated vector instructions. These are vector operations on selected vector lanes under the influence of opmask register. For non-AVX512 targets, masked vector operations are supported using an explicit vector blend operation after main vector operation which does the needed selection. 
>> 
>> This patch adds initial X86 backed support for predicated vector operations. 
>> 
>> Following is performance data for existing VectorAPI JMH benchmarks with the patch:
>> Test System:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake server 40C 2S)
>> 
>> Benchmark | SIZE | Baseline (ops/ms) | WithOpts (ops/ms) | Gain
>> -- | -- | -- | -- | --
>> Int512Vector.ABSMasked | 1024 | 10132.664 | 10394.942 | 1.025884407
>> Int512Vector.ADDMasked | 1024 | 7785.805 | 8980.133 | 1.153398139
>> Int512Vector.ADDMaskedLanes | 1024 | 5809.455 | 6350.628 | 1.093153833
>> Int512Vector.ANDMasked | 1024 | 7776.729 | 8965.988 | 1.152925349
>> Int512Vector.ANDMaskedLanes | 1024 | 6717.202 | 7426.217 | 1.105552133
>> Int512Vector.AND_NOTMasked | 1024 | 7688.835 | 8988.659 | 1.169053439
>> Int512Vector.ASHRMasked | 1024 | 6808.185 | 7883.755 | 1.1579819
>> Int512Vector.ASHRMaskedShift | 1024 | 9523.164 | 12166.72 | 1.277592195
>> Int512Vector.BITWISE_BLENDMasked | 1024 | 5919.647 | 6864.988 | 1.159695502
>> Int512Vector.DIVMasked | 1024 | 237.174 | 236.014 | 0.995109076
>> Int512Vector.FIRST_NONZEROMasked | 1024 | 5387.315 | 7890.42 | 1.464629412
>> Int512Vector.LSHLMasked | 1024 | 6806.898 | 7881.315 | 1.157842383
>> Int512Vector.LSHLMaskedShift | 1024 | 9552.257 | 12153.769 | 1.272345269
>> Int512Vector.LSHRMasked | 1024 | 6776.605 | 7897.786 | 1.165448776
>> Int512Vector.LSHRMaskedShift | 1024 | 9500.087 | 12134.962 | 1.277352723
>> Int512Vector.MAXMaskedLanes | 1024 | 6993.149 | 7580.399 | 1.083975045
>> Int512Vector.MINMaskedLanes | 1024 | 6925.363 | 7450.814 | 1.075873424
>> Int512Vector.MULMasked | 1024 | 7732.753 | 8956.02 | 1.158192949
>> Int512Vector.MULMaskedLanes | 1024 | 4066.384 | 4152.375 | 1.021146798
>> Int512Vector.NEGMasked | 1024 | 8760.797 | 9255.063 | 1.056417926
>> Int512Vector.NOTMasked | 1024 | 8981.123 | 9229.573 | 1.027663578
>> Int512Vector.ORMasked | 1024 | 7786.787 | 8967.057 | 1.151573428
>> Int512Vector.ORMaskedLanes | 1024 | 6694.36 | 7450.106 | 1.112892943
>> Int512Vector.SUBMasked | 1024 | 7782.939 | 9001.692 | 1.156592901
>> Int512Vector.XORMasked | 1024 | 7785.031 | 9070.342 | 1.165100306
>> Int512Vector.XORMaskedLanes | 1024 | 6700.689 | 7454.73 | 1.112531861
>> Int512Vector.ZOMOMasked | 1024 | 6982.297 | 8313.51 | 1.190655453
>> Int512Vector.gatherMasked | 1024 | 361.497 | 1494.876 | 4.135237637
>> Int512Vector.scatterMasked | 1024 | 490.05 | 3120.425 | 6.367564534
>> Int512Vector.sliceMasked | 1024 | 1436.248 | 1597.805 | 1.112485448
>> Int512Vector.unsliceMasked | 1024 | 296.721 | 346.434 | 1.167541226
>> Float512Vector.ADDMasked | 1024 | 7645.873 | 9123.386 | 1.193243205
>> Float512Vector.ADDMaskedLanes | 1024 | 2404.371 | 2529.284 | 1.051952465
>> Float512Vector.DIVMasked | 1024 | 5134.602 | 5129.085 | 0.998925525
>> Float512Vector.FIRST_NONZEROMasked | 1024 | 5040.567 | 7078.828 | 1.404371373
>> Float512Vector.FMAMasked | 1024 | 5996.419 | 6902.626 | 1.151124696
>> Float512Vector.MAXMaskedLanes | 1024 | 1681.249 | 1727.444 | 1.027476596
>> Float512Vector.MINMaskedLanes | 1024 | 1610.115 | 1667.143 | 1.035418588
>> Float512Vector.MULMasked | 1024 | 7812.317 | 9054.137 | 1.158956683
>> Float512Vector.MULMaskedLanes | 1024 | 2406.81 | 2514.018 | 1.044543608
>> Float512Vector.NEGMasked | 1024 | 8248.933 | 9834.607 | 1.192227771
>> Float512Vector.SQRTMasked | 1024 | 4278.046 | 4281.009 | 1.000692606
>> Float512Vector.SUBMasked | 1024 | 7697.582 | 9044.305 | 1.174954031
>> Float512Vector.gatherMasked | 1024 | 428.428 | 1491.441 | 3.48119404
>> Float512Vector.scatterMasked | 1024 | 416.169 | 3216.628 | 7.729138883
>> Float512Vector.sliceMasked | 1024 | 1431.07 | 1609.12 | 1.124417394
>> Float512Vector.unsliceMasked | 1024 | 292.513 | 331.366 | 1.132824866
>> 
>> 
>> 
>> PS: Above data shows the performance gains for two vector species Int512, Float512.  In general for all the species we see 1.2-2.x gains on various masking operation supported uptill now.
>> New matcher routine `Matcher::match_rule_supported_vector_masked`   lists making operations supported by this patch.
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
> 
>  - 8270349: Review comments resolution.
>  - Merge branch 'vectorIntrinsics+mask' of http://github.com/openjdk/panama-vector into JDK-8270349
>  - 8270349: Merge with latest vectorIntrinsics+mask tip + extend backend support for XorV,AndV,OrV and Compare masked operations.
>  - 8270349: Fix for 32-bit build failure.
>  - 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets.

Hi @sviswa7 , Thanks for take a look at this. your comments have been addressed.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/99