[vectorIntrinsics+mask] RFR: 8272971: Intrinsification of VectorMask.cast operation for all compatible vector species [v2]
Jatin Bhateja
jbhateja at openjdk.java.net
Thu Aug 26 20:54:19 UTC 2021
> - Patch intrinsifies VectorMask.cast operation if source and destination mask species are compatible i.e. have same vector length.
> - Handles casting for both predicated/non-predicated targets.
>
> Following is the performance data for new JMH benchmark included with the patch.
>
> System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (28C 2S Cascadelake Server)
>
> Benchmark | Baseline AVX512 (ops/ms) | Withopt AVX512 (ops/ms) | Gain ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
> -- | -- | -- | -- | -- | -- | --
> microMaskCastByte128ToInteger512 | 54516.035 | 112778.756 | 2.068726311 | 56353.873 | 48970.14 | 0.868975589
> microMaskCastByte128ToShort256 | 55216.805 | 114020.66 | 2.064963013 | 56834.785 | 114543.015 | 2.015368141
> microMaskCastByte256ToShort512 | 47392.839 | 90946.115 | 1.918984322 | 47412.246 | 43539.56 | 0.918318866
> microMaskCastByte64ToInteger256 | 62578.981 | 128643.386 | 2.055696401 | 68103.798 | 131857.429 | 1.936124458
> microMaskCastByte64ToLong512 | 65725.522 | 123135.03 | 1.873473595 | 68663.899 | 58686.52 | 0.854692507
> microMaskCastByte64ToShort128 | 62440.621 | 121789.41 | 1.950483644 | 68775.626 | 130610.463 | 1.899080686
> microMaskCastInteger128ToLong256 | 68458.06 | 130204.293 | 1.901957096 | 72769.986 | 132547.949 | 1.821464539
> microMaskCastInteger128ToShort64 | 67889.419 | 126591.52 | 1.864672314 | 72177.696 | 128137.316 | 1.775303495
> microMaskCastInteger256ToByte64 | 60895.223 | 130321.893 | 2.140100431 | 67920.658 | 130120.344 | 1.915769779
> microMaskCastInteger256ToLong512 | 65975.311 | 129705.935 | 1.965976864 | 69022.136 | 58334.489 | 0.845156241
> microMaskCastInteger256ToShort128 | 67545.659 | 125688.394 | 1.860791587 | 63210.734 | 130065.762 | 2.057653088
> microMaskCastInteger512ToByte128 | 51766.31 | 115913.374 | 2.239166245 | 56546.461 | 49007.128 | 0.866670118
> microMaskCastInteger512ToShort256 | 52156.663 | 109821.213 | 2.105602749 | 53074.581 | 48768.727 | 0.918871635
> microMaskCastInteger64ToLong128 | 73578.517 | 63373.966 | 0.861310727 | 74943.574 | 65043.773 | 0.867903271
> microMaskCastLong128ToInteger64 | 74027.908 | 63708.687 | 0.860603639 | 75332.964 | 65311.575 | 0.86697206
> microMaskCastLong256ToInteger128 | 71876.726 | 123125.286 | 1.713006321 | 73417.365 | 132380.982 | 1.803129028
> microMaskCastLong256ToShort64 | 72947.678 | 127544.459 | 1.748437545 | 73740.351 | 131155.599 | 1.778613706
> microMaskCastLong512ToByte64 | 66746.009 | 126422.173 | 1.894078386 | 68695.16 | 58410.73 | 0.85028887
> microMaskCastLong512ToInteger256 | 66989.512 | 120517.044 | 1.799043468 | 64162.04 | 58806.579 | 0.916532252
> microMaskCastLong512ToShort128 | 66560.838 | 126906.819 | 1.906628925 | 68702.838 | 58956.922 | 0.85814391
> microMaskCastShort128ToByte64 | 62698.789 | 126292.593 | 2.014274837 | 67675.889 | 128556.324 | 1.899588257
> microMaskCastShort128ToInteger256 | 62545.978 | 130594.425 | 2.087974786 | 67611.643 | 126309.927 | 1.868168283
> microMaskCastShort128ToLong512 | 65828.219 | 125557.859 | 1.90735616 | 69019.63 | 57951.985 | 0.839644968
> microMaskCastShort256ToByte128 | 51423.139 | 116624.494 | 2.267938058 | 56031.712 | 116504.228 | 2.07925519
> microMaskCastShort256ToInteger512 | 51563.845 | 110798.412 | 2.148761637 | 56541.831 | 49175.688 | 0.869722242
> microMaskCastShort512ToByte256 | 47761.772 | 91753.708 | 1.921070014 | 45410.684 | 42683.147 | 0.939936227
> microMaskCastShort64ToInteger128 | 69075.232 | 129302.738 | 1.871911744 | 72453.087 | 126654.897 | 1.748095247
> microMaskCastShort64ToLong256 | 68596.655 | 130142.777 | 1.897217539 | 72278.575 | 127633.658 | 1.76585742
>
> PS: Around 2x gains is seen in all cases for fast path (C2 inline expansion) and slight degradation over AVX2 on slow path (interpreted) in cases where target do not support 512 bit vector due to additional call overhead.
Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
8272971: Optimizing IR for mask-casting over non-predicated targets.
-------------
Changes:
- all: https://git.openjdk.java.net/panama-vector/pull/113/files
- new: https://git.openjdk.java.net/panama-vector/pull/113/files/5f041f21..3b18e774
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=panama-vector&pr=113&range=01
- incr: https://webrevs.openjdk.java.net/?repo=panama-vector&pr=113&range=00-01
Stats: 21 lines in 1 file changed: 18 ins; 0 del; 3 mod
Patch: https://git.openjdk.java.net/panama-vector/pull/113.diff
Fetch: git fetch https://git.openjdk.java.net/panama-vector pull/113/head:pull/113
PR: https://git.openjdk.java.net/panama-vector/pull/113
More information about the panama-dev
mailing list