[vectorIntrinsics+mask] RFR: 8272971: Intrinsification of VectorMask.cast operation for all compatible vector species [v2]

Thu Aug 26 22:35:43 UTC 2021

On Thu, 26 Aug 2021 20:54:19 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> - Patch intrinsifies VectorMask.cast operation if source and destination mask species are compatible i.e. have same vector length.
>> - Handles casting for both predicated/non-predicated targets.
>> 
>> Following is the performance data for new JMH benchmark included with the patch.
>> 
>> System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (28C 2S Cascadelake Server)
>> Benchmark | Baseline AVX512 (ops/ms) | Withopt AVX512 (ops/ms) | Gain ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> microMaskCastByte128ToInteger512 | 54516.035 | 112778.756 | 2.068726311 | 56144.479 | 48677.988 | 0.867012908
>> microMaskCastByte128ToShort256 | 55216.805 | 114020.66 | 2.064963013 | 52357.222 | 113713.843 | 2.171884578
>> microMaskCastByte256ToShort512 | 47392.839 | 90946.115 | 1.918984322 | 46976.122 | 44040.585 | 0.937510018
>> microMaskCastByte64ToInteger256 | 62578.981 | 128643.386 | 2.055696401 | 64291.206 | 125241.322 | 1.948031928
>> microMaskCastByte64ToLong512 | 65725.522 | 123135.03 | 1.873473595 | 63500.39 | 57353.881 | 0.903205177
>> microMaskCastByte64ToShort128 | 62440.621 | 121789.41 | 1.950483644 | 68406.484 | 129829.223 | 1.897908143
>> microMaskCastInteger128ToLong256 | 68458.06 | 130204.293 | 1.901957096 | 73194.15 | 129671.204 | 1.771606119
>> microMaskCastInteger128ToShort64 | 67889.419 | 126591.52 | 1.864672314 | 72413.82 | 129555.214 | 1.789095148
>> microMaskCastInteger256ToByte64 | 60895.223 | 130321.893 | 2.140100431 | 64238.202 | 126321.452 | 1.966453731
>> microMaskCastInteger256ToLong512 | 65975.311 | 129705.935 | 1.965976864 | 68179.69 | 57691.751 | 0.846172093
>> microMaskCastInteger256ToShort128 | 67545.659 | 125688.394 | 1.860791587 | 63548.106 | 122347.947 | 1.925280779
>> microMaskCastInteger512ToByte128 | 51766.31 | 115913.374 | 2.239166245 | 55993.494 | 49020.628 | 0.875470068
>> microMaskCastInteger512ToShort256 | 52156.663 | 109821.213 | 2.105602749 | 56366.012 | 48907.786 | 0.867682212
>> microMaskCastInteger64ToLong128 | 73578.517 | 63373.966 | 0.861310727 | 74174.816 | 63532.575 | 0.856524875
>> microMaskCastLong128ToInteger64 | 74027.908 | 63708.687 | 0.860603639 | 68350.908 | 64608.882 | 0.945252724
>> microMaskCastLong256ToInteger128 | 71876.726 | 123125.286 | 1.713006321 | 69808.173 | 129450.203 | 1.854370304
>> microMaskCastLong256ToShort64 | 72947.678 | 127544.459 | 1.748437545 | 72577.142 | 129282.92 | 1.781317319
>> microMaskCastLong512ToByte64 | 66746.009 | 126422.173 | 1.894078386 | 68758.915 | 58392.958 | 0.849241993
>> microMaskCastLong512ToInteger256 | 66989.512 | 120517.044 | 1.799043468 | 62663.689 | 58091.934 | 0.927042996
>> microMaskCastLong512ToShort128 | 66560.838 | 126906.819 | 1.906628925 | 64319.673 | 58479.2 | 0.909196165
>> microMaskCastShort128ToByte64 | 62698.789 | 126292.593 | 2.014274837 | 68764.768 | 131602.165 | 1.91380221
>> microMaskCastShort128ToInteger256 | 62545.978 | 130594.425 | 2.087974786 | 63122.811 | 131626.603 | 2.085246219
>> microMaskCastShort128ToLong512 | 65828.219 | 125557.859 | 1.90735616 | 68457.314 | 58924.963 | 0.86075482
>> microMaskCastShort256ToByte128 | 51423.139 | 116624.494 | 2.267938058 | 55950.597 | 111919.098 | 2.000319997
>> microMaskCastShort256ToInteger512 | 51563.845 | 110798.412 | 2.148761637 | 54465.523 | 48667.357 | 0.893544289
>> microMaskCastShort512ToByte256 | 47761.772 | 91753.708 | 1.921070014 | 47341.838 | 44144.299 | 0.932458495
>> microMaskCastShort64ToInteger128 | 69075.232 | 129302.738 | 1.871911744 | 71861.612 | 125784.021 | 1.75036459
>> microMaskCastShort64ToLong256 | 68596.655 | 130142.777 | 1.897217539 | 72313.591 | 130789.753 | 1.808646911
>> 
>> 
>> PS:  Around 2x gains is seen in all cases for fast path (C2 inline expansion) and slight degradation over AVX2 on slow path (interpreted) in cases where target do not support 512 bit vector due to additional call overhead.
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8272971: Optimizing IR for mask-casting over non-predicated targets.

Marked as reviewed by sviswanathan (Committer).

The patch looks good to me.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/113