[vectorIntrinsics+mask] Integrated: 8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets.

Mon Aug 23 09:59:32 UTC 2021

On Thu, 22 Jul 2021 06:31:06 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> Intel targets supporting AVX512 feature offer predicated vector instructions. These are vector operations on selected vector lanes under the influence of opmask register. For non-AVX512 targets, masked vector operations are supported using an explicit vector blend operation after main vector operation which does the needed selection. 
> 
> This patch adds initial X86 backed support for predicated vector operations. 
> 
> Following is performance data for existing VectorAPI JMH benchmarks with the patch:
> Test System:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake server 40C 2S)
> 
> Benchmark | SIZE | Baseline (ops/ms) | WithOpts (ops/ms) | Gain
> -- | -- | -- | -- | --
> Int512Vector.ABSMasked | 1024 | 10132.664 | 10394.942 | 1.025884407
> Int512Vector.ADDMasked | 1024 | 7785.805 | 8980.133 | 1.153398139
> Int512Vector.ADDMaskedLanes | 1024 | 5809.455 | 6350.628 | 1.093153833
> Int512Vector.ANDMasked | 1024 | 7776.729 | 8965.988 | 1.152925349
> Int512Vector.ANDMaskedLanes | 1024 | 6717.202 | 7426.217 | 1.105552133
> Int512Vector.AND_NOTMasked | 1024 | 7688.835 | 8988.659 | 1.169053439
> Int512Vector.ASHRMasked | 1024 | 6808.185 | 7883.755 | 1.1579819
> Int512Vector.ASHRMaskedShift | 1024 | 9523.164 | 12166.72 | 1.277592195
> Int512Vector.BITWISE_BLENDMasked | 1024 | 5919.647 | 6864.988 | 1.159695502
> Int512Vector.DIVMasked | 1024 | 237.174 | 236.014 | 0.995109076
> Int512Vector.FIRST_NONZEROMasked | 1024 | 5387.315 | 7890.42 | 1.464629412
> Int512Vector.LSHLMasked | 1024 | 6806.898 | 7881.315 | 1.157842383
> Int512Vector.LSHLMaskedShift | 1024 | 9552.257 | 12153.769 | 1.272345269
> Int512Vector.LSHRMasked | 1024 | 6776.605 | 7897.786 | 1.165448776
> Int512Vector.LSHRMaskedShift | 1024 | 9500.087 | 12134.962 | 1.277352723
> Int512Vector.MAXMaskedLanes | 1024 | 6993.149 | 7580.399 | 1.083975045
> Int512Vector.MINMaskedLanes | 1024 | 6925.363 | 7450.814 | 1.075873424
> Int512Vector.MULMasked | 1024 | 7732.753 | 8956.02 | 1.158192949
> Int512Vector.MULMaskedLanes | 1024 | 4066.384 | 4152.375 | 1.021146798
> Int512Vector.NEGMasked | 1024 | 8760.797 | 9255.063 | 1.056417926
> Int512Vector.NOTMasked | 1024 | 8981.123 | 9229.573 | 1.027663578
> Int512Vector.ORMasked | 1024 | 7786.787 | 8967.057 | 1.151573428
> Int512Vector.ORMaskedLanes | 1024 | 6694.36 | 7450.106 | 1.112892943
> Int512Vector.SUBMasked | 1024 | 7782.939 | 9001.692 | 1.156592901
> Int512Vector.XORMasked | 1024 | 7785.031 | 9070.342 | 1.165100306
> Int512Vector.XORMaskedLanes | 1024 | 6700.689 | 7454.73 | 1.112531861
> Int512Vector.ZOMOMasked | 1024 | 6982.297 | 8313.51 | 1.190655453
> Int512Vector.gatherMasked | 1024 | 361.497 | 1494.876 | 4.135237637
> Int512Vector.scatterMasked | 1024 | 490.05 | 3120.425 | 6.367564534
> Int512Vector.sliceMasked | 1024 | 1436.248 | 1597.805 | 1.112485448
> Int512Vector.unsliceMasked | 1024 | 296.721 | 346.434 | 1.167541226
> Float512Vector.ADDMasked | 1024 | 7645.873 | 9123.386 | 1.193243205
> Float512Vector.ADDMaskedLanes | 1024 | 2404.371 | 2529.284 | 1.051952465
> Float512Vector.DIVMasked | 1024 | 5134.602 | 5129.085 | 0.998925525
> Float512Vector.FIRST_NONZEROMasked | 1024 | 5040.567 | 7078.828 | 1.404371373
> Float512Vector.FMAMasked | 1024 | 5996.419 | 6902.626 | 1.151124696
> Float512Vector.MAXMaskedLanes | 1024 | 1681.249 | 1727.444 | 1.027476596
> Float512Vector.MINMaskedLanes | 1024 | 1610.115 | 1667.143 | 1.035418588
> Float512Vector.MULMasked | 1024 | 7812.317 | 9054.137 | 1.158956683
> Float512Vector.MULMaskedLanes | 1024 | 2406.81 | 2514.018 | 1.044543608
> Float512Vector.NEGMasked | 1024 | 8248.933 | 9834.607 | 1.192227771
> Float512Vector.SQRTMasked | 1024 | 4278.046 | 4281.009 | 1.000692606
> Float512Vector.SUBMasked | 1024 | 7697.582 | 9044.305 | 1.174954031
> Float512Vector.gatherMasked | 1024 | 428.428 | 1491.441 | 3.48119404
> Float512Vector.scatterMasked | 1024 | 416.169 | 3216.628 | 7.729138883
> Float512Vector.sliceMasked | 1024 | 1431.07 | 1609.12 | 1.124417394
> Float512Vector.unsliceMasked | 1024 | 292.513 | 331.366 | 1.132824866
> 
> 
> 
> PS: Above data shows the performance gains for two vector species Int512, Float512.  In general for all the species we see 1.2-2.x gains on various masking operation supported uptill now.
> New matcher routine `Matcher::match_rule_supported_vector_masked`   lists making operations supported by this patch.

This pull request has now been integrated.

Changeset: 60aa8ca6
Author:    Jatin Bhateja <jbhateja at openjdk.org>
URL:       https://git.openjdk.java.net/panama-vector/commit/60aa8ca6dc0b3f1a3ee517db167f9660012858cd
Stats:     3419 lines in 18 files changed: 3243 ins; 70 del; 106 mod

8270349: Initial X86 backend support for optimizing masking operations on AVX512 targets.

Reviewed-by: sviswanathan

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/99