[vectorIntrinsics+mask] RFR: 8272971: Intrinsification of VectorMask.cast operation for all compatible vector species [v2]

Mon Aug 30 03:57:07 UTC 2021

On Thu, 26 Aug 2021 20:54:19 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> - Patch intrinsifies VectorMask.cast operation if source and destination mask species are compatible i.e. have same vector length.
>> - Handles casting for both predicated/non-predicated targets.
>> 
>> Following is the performance data for new JMH benchmark included with the patch.
>> 
>> System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (28C 2S Cascadelake Server)
>> Benchmark | Baseline AVX512 (ops/ms) | Withopt AVX512 (ops/ms) | Gain ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> microMaskCastByte128ToInteger512 | 54516.035 | 112778.756 | 2.068726311 | 56144.479 | 48677.988 | 0.867012908
>> microMaskCastByte128ToShort256 | 55216.805 | 114020.66 | 2.064963013 | 52357.222 | 113713.843 | 2.171884578
>> microMaskCastByte256ToShort512 | 47392.839 | 90946.115 | 1.918984322 | 46976.122 | 44040.585 | 0.937510018
>> microMaskCastByte64ToInteger256 | 62578.981 | 128643.386 | 2.055696401 | 64291.206 | 125241.322 | 1.948031928
>> microMaskCastByte64ToLong512 | 65725.522 | 123135.03 | 1.873473595 | 63500.39 | 57353.881 | 0.903205177
>> microMaskCastByte64ToShort128 | 62440.621 | 121789.41 | 1.950483644 | 68406.484 | 129829.223 | 1.897908143
>> microMaskCastInteger128ToLong256 | 68458.06 | 130204.293 | 1.901957096 | 73194.15 | 129671.204 | 1.771606119
>> microMaskCastInteger128ToShort64 | 67889.419 | 126591.52 | 1.864672314 | 72413.82 | 129555.214 | 1.789095148
>> microMaskCastInteger256ToByte64 | 60895.223 | 130321.893 | 2.140100431 | 64238.202 | 126321.452 | 1.966453731
>> microMaskCastInteger256ToLong512 | 65975.311 | 129705.935 | 1.965976864 | 68179.69 | 57691.751 | 0.846172093
>> microMaskCastInteger256ToShort128 | 67545.659 | 125688.394 | 1.860791587 | 63548.106 | 122347.947 | 1.925280779
>> microMaskCastInteger512ToByte128 | 51766.31 | 115913.374 | 2.239166245 | 55993.494 | 49020.628 | 0.875470068
>> microMaskCastInteger512ToShort256 | 52156.663 | 109821.213 | 2.105602749 | 56366.012 | 48907.786 | 0.867682212
>> microMaskCastInteger64ToLong128 | 73578.517 | 63373.966 | 0.861310727 | 74174.816 | 63532.575 | 0.856524875
>> microMaskCastLong128ToInteger64 | 74027.908 | 63708.687 | 0.860603639 | 68350.908 | 64608.882 | 0.945252724
>> microMaskCastLong256ToInteger128 | 71876.726 | 123125.286 | 1.713006321 | 69808.173 | 129450.203 | 1.854370304
>> microMaskCastLong256ToShort64 | 72947.678 | 127544.459 | 1.748437545 | 72577.142 | 129282.92 | 1.781317319
>> microMaskCastLong512ToByte64 | 66746.009 | 126422.173 | 1.894078386 | 68758.915 | 58392.958 | 0.849241993
>> microMaskCastLong512ToInteger256 | 66989.512 | 120517.044 | 1.799043468 | 62663.689 | 58091.934 | 0.927042996
>> microMaskCastLong512ToShort128 | 66560.838 | 126906.819 | 1.906628925 | 64319.673 | 58479.2 | 0.909196165
>> microMaskCastShort128ToByte64 | 62698.789 | 126292.593 | 2.014274837 | 68764.768 | 131602.165 | 1.91380221
>> microMaskCastShort128ToInteger256 | 62545.978 | 130594.425 | 2.087974786 | 63122.811 | 131626.603 | 2.085246219
>> microMaskCastShort128ToLong512 | 65828.219 | 125557.859 | 1.90735616 | 68457.314 | 58924.963 | 0.86075482
>> microMaskCastShort256ToByte128 | 51423.139 | 116624.494 | 2.267938058 | 55950.597 | 111919.098 | 2.000319997
>> microMaskCastShort256ToInteger512 | 51563.845 | 110798.412 | 2.148761637 | 54465.523 | 48667.357 | 0.893544289
>> microMaskCastShort512ToByte256 | 47761.772 | 91753.708 | 1.921070014 | 47341.838 | 44144.299 | 0.932458495
>> microMaskCastShort64ToInteger128 | 69075.232 | 129302.738 | 1.871911744 | 71861.612 | 125784.021 | 1.75036459
>> microMaskCastShort64ToLong256 | 68596.655 | 130142.777 | 1.897217539 | 72313.591 | 130789.753 | 1.808646911
>> 
>> 
>> PS:  Around 2x gains is seen in all cases for fast path (C2 inline expansion) and slight degradation over AVX2 on slow path (interpreted) in cases where target do not support 512 bit vector due to additional call overhead.
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8272971: Optimizing IR for mask-casting over non-predicated targets.

src/hotspot/share/opto/vectorIntrinsics.cpp line 2320:

> 2318:   if (is_mask && (type2aelembytes(elem_bt_from) != type2aelembytes(elem_bt_to))) {
> 2319:     return false; // elem size mismatch
> 2320:   }

Since Arm SVE has different mask representation than AVX-512 for different types, I think this change will break existing assertion in VectorMaskCastNode constructor (https://github.com/openjdk/panama-vector/blob/vectorIntrinsics%2Bmask/src/hotspot/share/opto/vectornode.hpp#L1427)?

test/micro/org/openjdk/bench/jdk/incubator/vector/MaskCastOperationsBenchmark.java line 33:

> 31: @OutputTimeUnit(TimeUnit.MILLISECONDS)
> 32: @State(Scope.Thread)
> 33: public class MaskCastOperationsBenchmark {

Add jvmArgs "--add-modules=jdk.incubator.vector" like other benchmarks?

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/113