[vectorIntrinsics+mask] RFR: 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets. [v8]

Thu Aug 19 20:08:03 UTC 2021

On Thu, 19 Aug 2021 20:03:21 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Intel targets supporting AVX512 feature offer predicated vector instructions. These are vector operations on selected vector lanes under the influence of opmask register. For non-AVX512 targets, masked vector operations are supported using an explicit vector blend operation after main vector operation which does the needed selection. 
>> 
>> This patch adds initial X86 backed support for predicated vector operations. 
>> 
>> Following is performance data for existing VectorAPI JMH benchmarks with the patch:
>> Test System:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake server 40C 2S)
>> 
>> Benchmark | SIZE | Baseline (ops/ms) | WithOpts (ops/ms) | Gain
>> -- | -- | -- | -- | --
>> Int512Vector.ABSMasked | 1024 | 10132.664 | 10394.942 | 1.025884407
>> Int512Vector.ADDMasked | 1024 | 7785.805 | 8980.133 | 1.153398139
>> Int512Vector.ADDMaskedLanes | 1024 | 5809.455 | 6350.628 | 1.093153833
>> Int512Vector.ANDMasked | 1024 | 7776.729 | 8965.988 | 1.152925349
>> Int512Vector.ANDMaskedLanes | 1024 | 6717.202 | 7426.217 | 1.105552133
>> Int512Vector.AND_NOTMasked | 1024 | 7688.835 | 8988.659 | 1.169053439
>> Int512Vector.ASHRMasked | 1024 | 6808.185 | 7883.755 | 1.1579819
>> Int512Vector.ASHRMaskedShift | 1024 | 9523.164 | 12166.72 | 1.277592195
>> Int512Vector.BITWISE_BLENDMasked | 1024 | 5919.647 | 6864.988 | 1.159695502
>> Int512Vector.DIVMasked | 1024 | 237.174 | 236.014 | 0.995109076
>> Int512Vector.FIRST_NONZEROMasked | 1024 | 5387.315 | 7890.42 | 1.464629412
>> Int512Vector.LSHLMasked | 1024 | 6806.898 | 7881.315 | 1.157842383
>> Int512Vector.LSHLMaskedShift | 1024 | 9552.257 | 12153.769 | 1.272345269
>> Int512Vector.LSHRMasked | 1024 | 6776.605 | 7897.786 | 1.165448776
>> Int512Vector.LSHRMaskedShift | 1024 | 9500.087 | 12134.962 | 1.277352723
>> Int512Vector.MAXMaskedLanes | 1024 | 6993.149 | 7580.399 | 1.083975045
>> Int512Vector.MINMaskedLanes | 1024 | 6925.363 | 7450.814 | 1.075873424
>> Int512Vector.MULMasked | 1024 | 7732.753 | 8956.02 | 1.158192949
>> Int512Vector.MULMaskedLanes | 1024 | 4066.384 | 4152.375 | 1.021146798
>> Int512Vector.NEGMasked | 1024 | 8760.797 | 9255.063 | 1.056417926
>> Int512Vector.NOTMasked | 1024 | 8981.123 | 9229.573 | 1.027663578
>> Int512Vector.ORMasked | 1024 | 7786.787 | 8967.057 | 1.151573428
>> Int512Vector.ORMaskedLanes | 1024 | 6694.36 | 7450.106 | 1.112892943
>> Int512Vector.SUBMasked | 1024 | 7782.939 | 9001.692 | 1.156592901
>> Int512Vector.XORMasked | 1024 | 7785.031 | 9070.342 | 1.165100306
>> Int512Vector.XORMaskedLanes | 1024 | 6700.689 | 7454.73 | 1.112531861
>> Int512Vector.ZOMOMasked | 1024 | 6982.297 | 8313.51 | 1.190655453
>> Int512Vector.gatherMasked | 1024 | 361.497 | 1494.876 | 4.135237637
>> Int512Vector.scatterMasked | 1024 | 490.05 | 3120.425 | 6.367564534
>> Int512Vector.sliceMasked | 1024 | 1436.248 | 1597.805 | 1.112485448
>> Int512Vector.unsliceMasked | 1024 | 296.721 | 346.434 | 1.167541226
>> Float512Vector.ADDMasked | 1024 | 7645.873 | 9123.386 | 1.193243205
>> Float512Vector.ADDMaskedLanes | 1024 | 2404.371 | 2529.284 | 1.051952465
>> Float512Vector.DIVMasked | 1024 | 5134.602 | 5129.085 | 0.998925525
>> Float512Vector.FIRST_NONZEROMasked | 1024 | 5040.567 | 7078.828 | 1.404371373
>> Float512Vector.FMAMasked | 1024 | 5996.419 | 6902.626 | 1.151124696
>> Float512Vector.MAXMaskedLanes | 1024 | 1681.249 | 1727.444 | 1.027476596
>> Float512Vector.MINMaskedLanes | 1024 | 1610.115 | 1667.143 | 1.035418588
>> Float512Vector.MULMasked | 1024 | 7812.317 | 9054.137 | 1.158956683
>> Float512Vector.MULMaskedLanes | 1024 | 2406.81 | 2514.018 | 1.044543608
>> Float512Vector.NEGMasked | 1024 | 8248.933 | 9834.607 | 1.192227771
>> Float512Vector.SQRTMasked | 1024 | 4278.046 | 4281.009 | 1.000692606
>> Float512Vector.SUBMasked | 1024 | 7697.582 | 9044.305 | 1.174954031
>> Float512Vector.gatherMasked | 1024 | 428.428 | 1491.441 | 3.48119404
>> Float512Vector.scatterMasked | 1024 | 416.169 | 3216.628 | 7.729138883
>> Float512Vector.sliceMasked | 1024 | 1431.07 | 1609.12 | 1.124417394
>> Float512Vector.unsliceMasked | 1024 | 292.513 | 331.366 | 1.132824866
>> 
>> 
>> 
>> PS: Above data shows the performance gains for two vector species Int512, Float512.  In general for all the species we see 1.2-2.x gains on various masking operation supported uptill now.
>> New matcher routine `Matcher::match_rule_supported_vector_masked`   lists making operations supported by this patch.
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits:
> 
>  - 8270349: Jcheck error resolution (extra white space)
>  - Merge branch 'vectorIntrinsics+mask' of http://github.com/openjdk/panama-vector into JDK-8270349
>  - 8270349: Optimizing JIT sequence for alltrue/anytrue and maskAll operations.
>  - 8270349: Minor fix missed in last checkin.
>  - 8270349: Review comments resolution.
>  - 8270349: Review comments resolution.
>  - 8270349: Review comments resolution.
>  - Merge branch 'vectorIntrinsics+mask' of http://github.com/openjdk/panama-vector into JDK-8270349
>  - 8270349: Merge with latest vectorIntrinsics+mask tip + extend backend support for XorV,AndV,OrV and Compare masked operations.
>  - 8270349: Fix for 32-bit build failure.
>  - ... and 1 more: https://git.openjdk.java.net/panama-vector/compare/c1950c63...b001f939

Hi @sviswa7 , your comments have been addressed
Along with the latest merge, backend support for masked memory operations over byteArray and byteBuffers has also been added. New patterns for VectorReinterpret operation for mask types and minor IR change related to it.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/99