RFR: 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions
Andrew Haley
aph at openjdk.java.net
Mon Dec 14 10:02:55 UTC 2020
On Mon, 14 Dec 2020 05:57:36 GMT, Dong Bo <dongbo at openjdk.org> wrote:
> This patch optimizes vectorial rotate (immediate) on aarch64 with shift and insert instructions, i.e. SLI and SRI.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests under `test/hotspot/jtreg/compiler/c2/cr6340864/` runned specially for the correctness and passed.
>
> The JMH micro `test/micro/org/openjdk/bench/java/lang/RotateBenchmark.java` is used for performance test.
> Witnessed ~15.4% performance improvements on Kunpeng920 (CPU tsv110), but ~15.8% regression on Kunpeng916 (CPU A72).
> So a switch `UseSIMDShiftInsertForRotation` is introduced on aarch64 with default value `false`, and set `true` for Kunpeng920.
>
> The `RotateBenchmark.java` JMH micro-benchmark results on Kunpeng920:
> Benchmark (SHIFT) (TESTSIZE) Mode Cnt Score Error Units
>
> # kunpeng 920, -XX:-UseSIMDShiftInsertForRotation
> RotateBenchmark.testRotateLeftI 20 1024 thrpt 10 3524.840 ± 2.365 ops/ms
> RotateBenchmark.testRotateLeftIImm 20 1024 thrpt 10 3961.288 ± 0.897 ops/ms
> RotateBenchmark.testRotateLeftL 20 1024 thrpt 10 1704.321 ± 11.309 ops/ms
> RotateBenchmark.testRotateLeftLImm 20 1024 thrpt 10 2137.924 ± 2.215 ops/ms
> RotateBenchmark.testRotateRightI 20 1024 thrpt 10 3536.960 ± 7.945 ops/ms
> RotateBenchmark.testRotateRightIImm 20 1024 thrpt 10 3961.552 ± 0.673 ops/ms
> RotateBenchmark.testRotateRightL 20 1024 thrpt 10 1729.868 ± 0.468 ops/ms
> RotateBenchmark.testRotateRightLImm 20 1024 thrpt 10 2132.458 ± 3.385 ops/ms
>
> # kunpeng 920, default, -XX:+UseSIMDShiftInsertForRotation
> RotateBenchmark.testRotateLeftI 20 1024 thrpt 10 3504.602 ± 21.609 ops/ms
> RotateBenchmark.testRotateLeftIImm 20 1024 thrpt 10 4569.820 ± 7.455 ops/ms
> RotateBenchmark.testRotateLeftL 20 1024 thrpt 10 1730.735 ± 0.701 ops/ms
> RotateBenchmark.testRotateLeftLImm 20 1024 thrpt 10 2469.796 ± 0.981 ops/ms
> RotateBenchmark.testRotateRightI 20 1024 thrpt 10 3540.899 ± 7.679 ops/ms
> RotateBenchmark.testRotateRightIImm 20 1024 thrpt 10 4571.876 ± 0.879 ops/ms
> RotateBenchmark.testRotateRightL 20 1024 thrpt 10 1731.499 ± 0.877 ops/ms
> RotateBenchmark.testRotateRightLImm 20 1024 thrpt 10 2469.454 ± 0.705 ops/ms
>
> This also moves all logical and shifting NEON instructions from `aarch64.ad` to `aarch64_neon.ad`,
> and has two minor improvements of supporting vector length 4 for `vsraa8B_imm` and `vsrla8B_imm`, vector length 2 for `vsraa4S_imm` and `vsrla4S_imm`.
This patch is very hard to review because much of it is just moving things around. Please do this as two PRs, one which does all the moves and one with the substantive changes. Thanks.
-------------
PR: https://git.openjdk.java.net/jdk/pull/1761
More information about the hotspot-dev
mailing list