RFR: 8303762: [vectorapi] Intrinsification of Vector.slice
Quan Anh Mai
qamai at openjdk.org
Tue Mar 7 18:34:01 UTC 2023
The message from this sender included one or more files
which could not be scanned for virus detection; do not
open these files unless you are certain of the sender's intent.
----------------------------------------------------------------------
On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:
> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method.
>
> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations.
>
> Please take a look and have some reviews. Thank you very much.
Benchmark results:
Before After
Benchmark (size) Mode Cnt Score Error Score Error Units Change
Byte128Vector.sliceBinaryConstant 1024 thrpt 5 5058.760 ± 2214.115 8315.263 ± 102.169 ops/ms +64.37%
Byte256Vector.sliceBinaryConstant 1024 thrpt 5 6986.299 ± 1028.257 8440.387 ± 30.163 ops/ms +20.81%
Byte64Vector.sliceBinaryConstant 1024 thrpt 5 2944.869 ± 849.548 5926.054 ± 493.146 ops/ms +101.23%
ByteMaxVector.sliceBinaryConstant 1024 thrpt 5 7269.226 ± 366.246 8201.184 ± 309.539 ops/ms +12.82%
Double128Vector.sliceBinaryConstant 1024 thrpt 5 10.204 ± 0.508 979.287 ± 19.991 ops/ms x95.97
Double256Vector.sliceBinaryConstant 1024 thrpt 5 868.085 ± 26.378 967.799 ± 10.224 ops/ms +11.49%
DoubleMaxVector.sliceBinaryConstant 1024 thrpt 5 813.646 ± 74.468 978.150 ± 14.316 ops/ms +20.22%
Float128Vector.sliceBinaryConstant 1024 thrpt 5 1297.281 ± 23.650 1850.995 ± 29.741 ops/ms +42.68%
Float256Vector.sliceBinaryConstant 1024 thrpt 5 1796.121 ± 26.662 2011.362 ± 38.418 ops/ms +11.98%
Float64Vector.sliceBinaryConstant 1024 thrpt 5 10.381 ± 0.194 1628.510 ± 8.752 ops/ms x156.87
FloatMaxVector.sliceBinaryConstant 1024 thrpt 5 1820.161 ± 26.802 1988.085 ± 41.835 ops/ms +9.23%
Int128Vector.sliceBinaryConstant 1024 thrpt 5 1394.911 ± 40.815 1864.818 ± 33.792 ops/ms +33.69%
Int256Vector.sliceBinaryConstant 1024 thrpt 5 1874.496 ± 60.541 1864.818 ± 33.792 ops/ms -0.52%
Int64Vector.sliceBinaryConstant 1024 thrpt 5 10.942 ± 0.377 1621.849 ± 56.538 ops/ms x148.22
IntMaxVector.sliceBinaryConstant 1024 thrpt 5 1870.746 ± 40.665 2027.041 ± 25.880 ops/ms +8.35%
Long128Vector.sliceBinaryConstant 1024 thrpt 5 10.595 ± 0.306 991.969 ± 15.033 ops/ms x93.63
Long256Vector.sliceBinaryConstant 1024 thrpt 5 815.689 ± 12.243 989.365 ± 25.969 ops/ms +21.29%
LongMaxVector.sliceBinaryConstant 1024 thrpt 5 822.060 ± 12.337 977.061 ± 31.968 ops/ms +18.86%
Short128Vector.sliceBinaryConstant 1024 thrpt 5 3062.676 ± 124.796 3890.796 ± 326.767 ops/ms +27.04%
Short256Vector.sliceBinaryConstant 1024 thrpt 5 3747.778 ± 119.356 4125.463 ± 33.602 ops/ms +10.08%
Short64Vector.sliceBinaryConstant 1024 thrpt 5 1879.203 ± 69.160 2899.515 ± 57.870 ops/ms +54.29%
ShortMaxVector.sliceBinaryConstant 1024 thrpt 5 3717.217 ± 48.876 4035.455 ± 102.725 ops/ms +8.56%
-------------
PR: https://git.openjdk.org/jdk/pull/12909
More information about the hotspot-compiler-dev
mailing list