RFR: 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions
Dong Bo
dongbo at openjdk.java.net
Mon Dec 14 06:02:02 UTC 2020
This patch optimizes vectorial rotate (immediate) on aarch64 with shift and insert instructions, i.e. SLI and SRI.
Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
Tests under `test/hotspot/jtreg/compiler/c2/cr6340864/` runned specially for the correctness and passed.
The JMH micro `test/micro/org/openjdk/bench/java/lang/RotateBenchmark.java` is used for performance test.
Witnessed ~15.4% performance improvements on Kunpeng920 (CPU tsv110), but ~15.8% regression on Kunpeng916 (CPU A72).
So a switch `UseSIMDShiftInsertForRotation` is introduced on aarch64 with default value `false`, and set `true` for Kunpeng920.
The `RotateBenchmark.java` JMH micro-benchmark results on Kunpeng920:
Benchmark (SHIFT) (TESTSIZE) Mode Cnt Score Error Units
# kunpeng 920, -XX:-UseSIMDShiftInsertForRotation
RotateBenchmark.testRotateLeftI 20 1024 thrpt 10 3524.840 ± 2.365 ops/ms
RotateBenchmark.testRotateLeftIImm 20 1024 thrpt 10 3961.288 ± 0.897 ops/ms
RotateBenchmark.testRotateLeftL 20 1024 thrpt 10 1704.321 ± 11.309 ops/ms
RotateBenchmark.testRotateLeftLImm 20 1024 thrpt 10 2137.924 ± 2.215 ops/ms
RotateBenchmark.testRotateRightI 20 1024 thrpt 10 3536.960 ± 7.945 ops/ms
RotateBenchmark.testRotateRightIImm 20 1024 thrpt 10 3961.552 ± 0.673 ops/ms
RotateBenchmark.testRotateRightL 20 1024 thrpt 10 1729.868 ± 0.468 ops/ms
RotateBenchmark.testRotateRightLImm 20 1024 thrpt 10 2132.458 ± 3.385 ops/ms
# kunpeng 920, default, -XX:+UseSIMDShiftInsertForRotation
RotateBenchmark.testRotateLeftI 20 1024 thrpt 10 3504.602 ± 21.609 ops/ms
RotateBenchmark.testRotateLeftIImm 20 1024 thrpt 10 4569.820 ± 7.455 ops/ms
RotateBenchmark.testRotateLeftL 20 1024 thrpt 10 1730.735 ± 0.701 ops/ms
RotateBenchmark.testRotateLeftLImm 20 1024 thrpt 10 2469.796 ± 0.981 ops/ms
RotateBenchmark.testRotateRightI 20 1024 thrpt 10 3540.899 ± 7.679 ops/ms
RotateBenchmark.testRotateRightIImm 20 1024 thrpt 10 4571.876 ± 0.879 ops/ms
RotateBenchmark.testRotateRightL 20 1024 thrpt 10 1731.499 ± 0.877 ops/ms
RotateBenchmark.testRotateRightLImm 20 1024 thrpt 10 2469.454 ± 0.705 ops/ms
This also moves all logical and shifting NEON instructions from `aarch64.ad` to `aarch64_neon.ad`,
and has two minor improvements of supporting vector length 4 for `vsraa8B_imm` and `vsrla8B_imm`, vector length 2 for `vsraa4S_imm` and `vsrla4S_imm`.
-------------
Commit messages:
- 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions
Changes: https://git.openjdk.java.net/jdk/pull/1761/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1761&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8256820
Stats: 2899 lines in 9 files changed: 1561 ins; 1014 del; 324 mod
Patch: https://git.openjdk.java.net/jdk/pull/1761.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/1761/head:pull/1761
PR: https://git.openjdk.java.net/jdk/pull/1761
More information about the core-libs-dev
mailing list