[vectorIntrinsics+mask] RFR: 8272971: Intrinsification of VectorMask.cast operation for all compatible vector species [v2]

Paul Sandoz psandoz at openjdk.java.net
Fri Aug 27 17:19:45 UTC 2021


On Thu, 26 Aug 2021 20:54:19 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> - Patch intrinsifies VectorMask.cast operation if source and destination mask species are compatible i.e. have same vector length.
>> - Handles casting for both predicated/non-predicated targets.
>> 
>> Following is the performance data for new JMH benchmark included with the patch.
>> 
>> System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (28C 2S Cascadelake Server)
>> Benchmark | Baseline AVX512 (ops/ms) | Withopt AVX512 (ops/ms) | Gain ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> microMaskCastByte128ToInteger512 | 54516.035 | 112778.756 | 2.068726311 | 56144.479 | 48677.988 | 0.867012908
>> microMaskCastByte128ToShort256 | 55216.805 | 114020.66 | 2.064963013 | 52357.222 | 113713.843 | 2.171884578
>> microMaskCastByte256ToShort512 | 47392.839 | 90946.115 | 1.918984322 | 46976.122 | 44040.585 | 0.937510018
>> microMaskCastByte64ToInteger256 | 62578.981 | 128643.386 | 2.055696401 | 64291.206 | 125241.322 | 1.948031928
>> microMaskCastByte64ToLong512 | 65725.522 | 123135.03 | 1.873473595 | 63500.39 | 57353.881 | 0.903205177
>> microMaskCastByte64ToShort128 | 62440.621 | 121789.41 | 1.950483644 | 68406.484 | 129829.223 | 1.897908143
>> microMaskCastInteger128ToLong256 | 68458.06 | 130204.293 | 1.901957096 | 73194.15 | 129671.204 | 1.771606119
>> microMaskCastInteger128ToShort64 | 67889.419 | 126591.52 | 1.864672314 | 72413.82 | 129555.214 | 1.789095148
>> microMaskCastInteger256ToByte64 | 60895.223 | 130321.893 | 2.140100431 | 64238.202 | 126321.452 | 1.966453731
>> microMaskCastInteger256ToLong512 | 65975.311 | 129705.935 | 1.965976864 | 68179.69 | 57691.751 | 0.846172093
>> microMaskCastInteger256ToShort128 | 67545.659 | 125688.394 | 1.860791587 | 63548.106 | 122347.947 | 1.925280779
>> microMaskCastInteger512ToByte128 | 51766.31 | 115913.374 | 2.239166245 | 55993.494 | 49020.628 | 0.875470068
>> microMaskCastInteger512ToShort256 | 52156.663 | 109821.213 | 2.105602749 | 56366.012 | 48907.786 | 0.867682212
>> microMaskCastInteger64ToLong128 | 73578.517 | 63373.966 | 0.861310727 | 74174.816 | 63532.575 | 0.856524875
>> microMaskCastLong128ToInteger64 | 74027.908 | 63708.687 | 0.860603639 | 68350.908 | 64608.882 | 0.945252724
>> microMaskCastLong256ToInteger128 | 71876.726 | 123125.286 | 1.713006321 | 69808.173 | 129450.203 | 1.854370304
>> microMaskCastLong256ToShort64 | 72947.678 | 127544.459 | 1.748437545 | 72577.142 | 129282.92 | 1.781317319
>> microMaskCastLong512ToByte64 | 66746.009 | 126422.173 | 1.894078386 | 68758.915 | 58392.958 | 0.849241993
>> microMaskCastLong512ToInteger256 | 66989.512 | 120517.044 | 1.799043468 | 62663.689 | 58091.934 | 0.927042996
>> microMaskCastLong512ToShort128 | 66560.838 | 126906.819 | 1.906628925 | 64319.673 | 58479.2 | 0.909196165
>> microMaskCastShort128ToByte64 | 62698.789 | 126292.593 | 2.014274837 | 68764.768 | 131602.165 | 1.91380221
>> microMaskCastShort128ToInteger256 | 62545.978 | 130594.425 | 2.087974786 | 63122.811 | 131626.603 | 2.085246219
>> microMaskCastShort128ToLong512 | 65828.219 | 125557.859 | 1.90735616 | 68457.314 | 58924.963 | 0.86075482
>> microMaskCastShort256ToByte128 | 51423.139 | 116624.494 | 2.267938058 | 55950.597 | 111919.098 | 2.000319997
>> microMaskCastShort256ToInteger512 | 51563.845 | 110798.412 | 2.148761637 | 54465.523 | 48667.357 | 0.893544289
>> microMaskCastShort512ToByte256 | 47761.772 | 91753.708 | 1.921070014 | 47341.838 | 44144.299 | 0.932458495
>> microMaskCastShort64ToInteger128 | 69075.232 | 129302.738 | 1.871911744 | 71861.612 | 125784.021 | 1.75036459
>> microMaskCastShort64ToLong256 | 68596.655 | 130142.777 | 1.897217539 | 72313.591 | 130789.753 | 1.808646911
>> 
>> 
>> PS:  Around 2x gains is seen in all cases for fast path (C2 inline expansion) and slight degradation over AVX2 on slow path (interpreted) in cases where target do not support 512 bit vector due to additional call overhead.
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8272971: Optimizing IR for mask-casting over non-predicated targets.

Java changes look good, there is a simplification i think we should make before integrating.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java line 655:

> 653:                 species.maskType(), species.elementType(), VLENGTH,
> 654:                 this, species,
> 655:                 Byte128Mask::defaultMaskCast);

We can inline defaultMaskCheck and remove it e.g.:

(m, s) -> s.maskFactory(m.toArray()).check(s);

-------------

Marked as reviewed by psandoz (Committer).

PR: https://git.openjdk.java.net/panama-vector/pull/113


More information about the panama-dev mailing list