RFR: 8303762: [vectorapi] Intrinsification of Vector.slice

Tue Mar 7 18:34:01 UTC 2023

The message from this sender included one or more files
which could not be scanned for virus detection; do not
open these files unless you are certain of the sender's intent.

----------------------------------------------------------------------
On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method.
> 
> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations.
> 
> Please take a look and have some reviews. Thank you very much.

Benchmark results:

                                                                   Before                After
    Benchmark                            (size)   Mode  Cnt     Score      Error     Score     Error   Units    Change
    Byte128Vector.sliceBinaryConstant      1024  thrpt    5  5058.760 ± 2214.115  8315.263 ± 102.169  ops/ms   +64.37%
    Byte256Vector.sliceBinaryConstant      1024  thrpt    5  6986.299 ± 1028.257  8440.387 ±  30.163  ops/ms   +20.81%
    Byte64Vector.sliceBinaryConstant       1024  thrpt    5  2944.869 ±  849.548  5926.054 ± 493.146  ops/ms  +101.23%
    ByteMaxVector.sliceBinaryConstant      1024  thrpt    5  7269.226 ±  366.246  8201.184 ± 309.539  ops/ms   +12.82%
    Double128Vector.sliceBinaryConstant    1024  thrpt    5    10.204 ±    0.508   979.287 ±  19.991  ops/ms    x95.97
    Double256Vector.sliceBinaryConstant    1024  thrpt    5   868.085 ±   26.378   967.799 ±  10.224  ops/ms   +11.49%
    DoubleMaxVector.sliceBinaryConstant    1024  thrpt    5   813.646 ±   74.468   978.150 ±  14.316  ops/ms   +20.22%
    Float128Vector.sliceBinaryConstant     1024  thrpt    5  1297.281 ±   23.650  1850.995 ±  29.741  ops/ms   +42.68%
    Float256Vector.sliceBinaryConstant     1024  thrpt    5  1796.121 ±   26.662  2011.362 ±  38.418  ops/ms   +11.98%
    Float64Vector.sliceBinaryConstant      1024  thrpt    5    10.381 ±    0.194  1628.510 ±   8.752  ops/ms   x156.87
    FloatMaxVector.sliceBinaryConstant     1024  thrpt    5  1820.161 ±   26.802  1988.085 ±  41.835  ops/ms    +9.23%
    Int128Vector.sliceBinaryConstant       1024  thrpt    5  1394.911 ±   40.815  1864.818 ±  33.792  ops/ms   +33.69%
    Int256Vector.sliceBinaryConstant       1024  thrpt    5  1874.496 ±   60.541  1864.818 ±  33.792  ops/ms    -0.52%
    Int64Vector.sliceBinaryConstant        1024  thrpt    5    10.942 ±    0.377  1621.849 ±  56.538  ops/ms   x148.22
    IntMaxVector.sliceBinaryConstant       1024  thrpt    5  1870.746 ±   40.665  2027.041 ±  25.880  ops/ms    +8.35%
    Long128Vector.sliceBinaryConstant      1024  thrpt    5    10.595 ±    0.306   991.969 ±  15.033  ops/ms    x93.63
    Long256Vector.sliceBinaryConstant      1024  thrpt    5   815.689 ±   12.243   989.365 ±  25.969  ops/ms   +21.29%
    LongMaxVector.sliceBinaryConstant      1024  thrpt    5   822.060 ±   12.337   977.061 ±  31.968  ops/ms   +18.86%
    Short128Vector.sliceBinaryConstant     1024  thrpt    5  3062.676 ±  124.796  3890.796 ± 326.767  ops/ms   +27.04%
    Short256Vector.sliceBinaryConstant     1024  thrpt    5  3747.778 ±  119.356  4125.463 ±  33.602  ops/ms   +10.08%
    Short64Vector.sliceBinaryConstant      1024  thrpt    5  1879.203 ±   69.160  2899.515 ±  57.870  ops/ms   +54.29%
    ShortMaxVector.sliceBinaryConstant     1024  thrpt    5  3717.217 ±   48.876  4035.455 ± 102.725  ops/ms    +8.56%

-------------

PR: https://git.openjdk.org/jdk/pull/12909