RFR: 8312425: [vectorapi] AArch64: Optimize vector math operations with SLEEF

Xiaohong Gong xgong at openjdk.org
Wed Oct 18 06:19:23 UTC 2023


On Wed, 18 Oct 2023 06:12:29 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> Currently the vector floating-point math APIs like `VectorOperators.SIN/COS/TAN...` are not intrinsified on AArch64 platform, which causes large performance gap on AArch64. Note that those APIs are optimized by C2 compiler on X86 platforms by calling Intel's SVML code [1]. To close the gap, we would like to optimize these APIs for AArch64 by calling a third-party vector library called libsleef [2], which are available in mainstream Linux distros (e.g. [3] [4]).
> 
> SLEEF supports multiple accuracies. To match Vector API's requirement and implement the math ops on AArch64, we 1) call 1.0 ULP accuracy with FMA instructions used stubs in libsleef for most of the operations by default, and 2) add the vector calling convention to apply with the runtime calls to stub code in libsleef. Note that for those APIs that libsleef does not support 1.0 ULP, we choose 0.5 ULP instead.
> 
> To help loading the expected libsleef library, this patch also adds an experimental JVM option (i.e. `-XX:UseSleefLib`) for AArch64 platforms. People can use it to denote the libsleef path/name explicitly. By default, it points to the system installed library. If the library does not exist or the dynamic loading of it in runtime fails, the math vector ops will fall-back to use the default scalar version without error. But a warning is printed out if people specifies a nonexistent library explicitly.
> 
> Note that this is a part of the original proposed patch in panama-dev [5], just with some initial review comments addressed. And now we'd like to get some wider feedbacks from more hotspot experts.
> 
> [1] https://github.com/openjdk/jdk/pull/3638
> [2] https://sleef.org/
> [3] https://packages.fedoraproject.org/pkgs/sleef/sleef/
> [4] https://packages.debian.org/bookworm/libsleef3
> [5] https://mail.openjdk.org/pipermail/panama-dev/2022-December/018172.html

Here is the performance improvement for JMH benchmarks [1] [2] after enabling libsleef for AArch64 NEON and SVE:

NEON:

Benchmark               (size)  Mode  Cnt   Gain
DoubleMaxVector.ACOS     1024  thrpt   5   1.775
DoubleMaxVector.ASIN     1024  thrpt   5   2.134
DoubleMaxVector.ATAN     1024  thrpt   5   2.376
DoubleMaxVector.ATAN2    1024  thrpt   5   2.799
DoubleMaxVector.CBRT     1024  thrpt   5   1.588
DoubleMaxVector.COS      1024  thrpt   5   1.751
DoubleMaxVector.COSH     1024  thrpt   5   1.756
DoubleMaxVector.EXP      1024  thrpt   5   8.257
DoubleMaxVector.EXPM1    1024  thrpt   5   2.028
DoubleMaxVector.HYPOT    1024  thrpt   5   2.132
DoubleMaxVector.LOG      1024  thrpt   5   4.017
DoubleMaxVector.LOG10    1024  thrpt   5   5.693
DoubleMaxVector.LOG1P    1024  thrpt   5   2.788
DoubleMaxVector.POW      1024  thrpt   5   3.494
DoubleMaxVector.SIN      1024  thrpt   5   2.010
DoubleMaxVector.SINH     1024  thrpt   5   1.697
DoubleMaxVector.TAN      1024  thrpt   5   3.448
DoubleMaxVector.TANH     1024  thrpt   5   0.984
FloatMaxVector.ACOS      1024  thrpt   5   2.310
FloatMaxVector.ASIN      1024  thrpt   5   2.887
FloatMaxVector.ATAN      1024  thrpt   5   3.076
FloatMaxVector.ATAN2     1024  thrpt   5   4.162
FloatMaxVector.CBRT      1024  thrpt   5   2.941
FloatMaxVector.COS       1024  thrpt   5   1.832
FloatMaxVector.COSH      1024  thrpt   5   2.681
FloatMaxVector.EXP       1024  thrpt   5  15.758
FloatMaxVector.EXPM1     1024  thrpt   5   3.061
FloatMaxVector.HYPOT     1024  thrpt   5   3.428
FloatMaxVector.LOG       1024  thrpt   5  12.364
FloatMaxVector.LOG10     1024  thrpt   5  11.267
FloatMaxVector.LOG1P     1024  thrpt   5   5.819
FloatMaxVector.POW       1024  thrpt   5   6.710
FloatMaxVector.SIN       1024  thrpt   5   1.906
FloatMaxVector.SINH      1024  thrpt   5   2.505
FloatMaxVector.TAN       1024  thrpt   5   4.975
FloatMaxVector.TANH      1024  thrpt   5   1.157
Float64Vector.ACOS       1024  thrpt   5   1.855
Float64Vector.ASIN       1024  thrpt   5   2.294
Float64Vector.ATAN       1024  thrpt   5   2.082
Float64Vector.ATAN2      1024  thrpt   5   2.849
Float64Vector.CBRT       1024  thrpt   5   1.781
Float64Vector.COS        1024  thrpt   5   1.224
Float64Vector.COSH       1024  thrpt   5   1.793
Float64Vector.EXP        1024  thrpt   5   9.000
Float64Vector.EXPM1      1024  thrpt   5   2.096
Float64Vector.HYPOT      1024  thrpt   5   2.589
Float64Vector.LOG        1024  thrpt   5   5.582
Float64Vector.LOG10      1024  thrpt   5   5.495
Float64Vector.LOG1P      1024  thrpt   5   3.594
Float64Vector.POW        1024  thrpt   5   3.254
Float64Vector.SIN        1024  thrpt   5   1.254
Float64Vector.SINH       1024  thrpt   5   1.719
Float64Vector.TAN        1024  thrpt   5   2.670
Float64Vector.TANH       1024  thrpt   5   1.020


SVE 512-bit vector size:

Benchmark               (size)  Mode  Cnt   Gain
DoubleMaxVector.ACOS     1024  thrpt   5   1.731
DoubleMaxVector.ASIN     1024  thrpt   5   2.046
DoubleMaxVector.ATAN     1024  thrpt   5   4.932
DoubleMaxVector.ATAN2    1024  thrpt   5   6.032
DoubleMaxVector.CBRT     1024  thrpt   5   6.883
DoubleMaxVector.COS      1024  thrpt   5   5.512
DoubleMaxVector.COSH     1024  thrpt   5   2.796
DoubleMaxVector.EXP      1024  thrpt   5  42.490
DoubleMaxVector.EXPM1    1024  thrpt   5   6.188
DoubleMaxVector.HYPOT    1024  thrpt   5   2.195
DoubleMaxVector.LOG      1024  thrpt   5  19.532
DoubleMaxVector.LOG10    1024  thrpt   5  19.229
DoubleMaxVector.LOG1P    1024  thrpt   5  10.477
DoubleMaxVector.POW      1024  thrpt   5  11.887
DoubleMaxVector.SIN      1024  thrpt   5   6.073
DoubleMaxVector.SINH     1024  thrpt   5   2.994
DoubleMaxVector.TAN      1024  thrpt   5  15.417
FloatMaxVector.ACOS      1024  thrpt   5   3.867
FloatMaxVector.ASIN      1024  thrpt   5   4.291
FloatMaxVector.ATAN      1024  thrpt   5  11.786
FloatMaxVector.ATAN2     1024  thrpt   5  14.734
FloatMaxVector.CBRT      1024  thrpt   5  11.622
FloatMaxVector.COS       1024  thrpt   5   6.477
FloatMaxVector.COSH      1024  thrpt   5   3.571
FloatMaxVector.EXP       1024  thrpt   5  53.020
FloatMaxVector.EXPM1     1024  thrpt   5   6.348
FloatMaxVector.HYPOT     1024  thrpt   5   4.722
FloatMaxVector.LOG       1024  thrpt   5  41.263
FloatMaxVector.LOG10     1024  thrpt   5  47.685
FloatMaxVector.LOG1P     1024  thrpt   5  22.481
FloatMaxVector.POW       1024  thrpt   5  24.896
FloatMaxVector.SIN       1024  thrpt   5   6.768
FloatMaxVector.SINH      1024  thrpt   5   3.429

[1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L1068
[2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/DoubleMaxVector.java#L1068

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16234#issuecomment-1767727028


More information about the hotspot-dev mailing list