RFR: 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions

Mon Dec 14 10:02:55 UTC 2020

On Mon, 14 Dec 2020 05:57:36 GMT, Dong Bo <dongbo at openjdk.org> wrote:

> This patch optimizes vectorial rotate (immediate) on aarch64 with shift and insert instructions, i.e. SLI and SRI.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests under `test/hotspot/jtreg/compiler/c2/cr6340864/` runned specially for the correctness and passed.
> 
> The JMH micro `test/micro/org/openjdk/bench/java/lang/RotateBenchmark.java` is used for performance test.
> Witnessed ~15.4% performance improvements on Kunpeng920 (CPU tsv110), but ~15.8% regression on Kunpeng916 (CPU A72).
> So a switch `UseSIMDShiftInsertForRotation` is introduced on aarch64 with default value `false`, and set `true` for Kunpeng920.
> 
> The `RotateBenchmark.java` JMH micro-benchmark results on Kunpeng920:
> Benchmark                            (SHIFT)  (TESTSIZE)   Mode  Cnt     Score    Error   Units
> 
> # kunpeng 920, -XX:-UseSIMDShiftInsertForRotation
> RotateBenchmark.testRotateLeftI           20        1024  thrpt   10  3524.840 ±  2.365  ops/ms
> RotateBenchmark.testRotateLeftIImm        20        1024  thrpt   10  3961.288 ±  0.897  ops/ms
> RotateBenchmark.testRotateLeftL           20        1024  thrpt   10  1704.321 ± 11.309  ops/ms
> RotateBenchmark.testRotateLeftLImm        20        1024  thrpt   10  2137.924 ±  2.215  ops/ms
> RotateBenchmark.testRotateRightI          20        1024  thrpt   10  3536.960 ±  7.945  ops/ms
> RotateBenchmark.testRotateRightIImm       20        1024  thrpt   10  3961.552 ±  0.673  ops/ms
> RotateBenchmark.testRotateRightL          20        1024  thrpt   10  1729.868 ±  0.468  ops/ms
> RotateBenchmark.testRotateRightLImm       20        1024  thrpt   10  2132.458 ±  3.385  ops/ms
> 
> # kunpeng 920, default, -XX:+UseSIMDShiftInsertForRotation
> RotateBenchmark.testRotateLeftI           20        1024  thrpt   10  3504.602 ± 21.609  ops/ms
> RotateBenchmark.testRotateLeftIImm        20        1024  thrpt   10  4569.820 ±  7.455  ops/ms
> RotateBenchmark.testRotateLeftL           20        1024  thrpt   10  1730.735 ±  0.701  ops/ms
> RotateBenchmark.testRotateLeftLImm        20        1024  thrpt   10  2469.796 ±  0.981  ops/ms
> RotateBenchmark.testRotateRightI          20        1024  thrpt   10  3540.899 ±  7.679  ops/ms
> RotateBenchmark.testRotateRightIImm       20        1024  thrpt   10  4571.876 ±  0.879  ops/ms
> RotateBenchmark.testRotateRightL          20        1024  thrpt   10  1731.499 ±  0.877  ops/ms
> RotateBenchmark.testRotateRightLImm       20        1024  thrpt   10  2469.454 ±  0.705  ops/ms
> 
> This also moves all logical and shifting NEON instructions from `aarch64.ad` to `aarch64_neon.ad`,
> and has two minor improvements of supporting vector length 4 for `vsraa8B_imm` and `vsrla8B_imm`, vector length 2 for `vsraa4S_imm` and `vsrla4S_imm`.

This patch is very hard to review because much of it is just moving things around. Please do this as two PRs, one which does all the moves and one with the substantive changes. Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1761