RFR: 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions

Mon Dec 14 06:02:02 UTC 2020

This patch optimizes vectorial rotate (immediate) on aarch64 with shift and insert instructions, i.e. SLI and SRI.

Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
Tests under `test/hotspot/jtreg/compiler/c2/cr6340864/` runned specially for the correctness and passed.

The JMH micro `test/micro/org/openjdk/bench/java/lang/RotateBenchmark.java` is used for performance test.
Witnessed ~15.4% performance improvements on Kunpeng920 (CPU tsv110), but ~15.8% regression on Kunpeng916 (CPU A72).
So a switch `UseSIMDShiftInsertForRotation` is introduced on aarch64 with default value `false`, and set `true` for Kunpeng920.

The `RotateBenchmark.java` JMH micro-benchmark results on Kunpeng920:
Benchmark                            (SHIFT)  (TESTSIZE)   Mode  Cnt     Score    Error   Units

# kunpeng 920, -XX:-UseSIMDShiftInsertForRotation
RotateBenchmark.testRotateLeftI           20        1024  thrpt   10  3524.840 ±  2.365  ops/ms
RotateBenchmark.testRotateLeftIImm        20        1024  thrpt   10  3961.288 ±  0.897  ops/ms
RotateBenchmark.testRotateLeftL           20        1024  thrpt   10  1704.321 ± 11.309  ops/ms
RotateBenchmark.testRotateLeftLImm        20        1024  thrpt   10  2137.924 ±  2.215  ops/ms
RotateBenchmark.testRotateRightI          20        1024  thrpt   10  3536.960 ±  7.945  ops/ms
RotateBenchmark.testRotateRightIImm       20        1024  thrpt   10  3961.552 ±  0.673  ops/ms
RotateBenchmark.testRotateRightL          20        1024  thrpt   10  1729.868 ±  0.468  ops/ms
RotateBenchmark.testRotateRightLImm       20        1024  thrpt   10  2132.458 ±  3.385  ops/ms

# kunpeng 920, default, -XX:+UseSIMDShiftInsertForRotation
RotateBenchmark.testRotateLeftI           20        1024  thrpt   10  3504.602 ± 21.609  ops/ms
RotateBenchmark.testRotateLeftIImm        20        1024  thrpt   10  4569.820 ±  7.455  ops/ms
RotateBenchmark.testRotateLeftL           20        1024  thrpt   10  1730.735 ±  0.701  ops/ms
RotateBenchmark.testRotateLeftLImm        20        1024  thrpt   10  2469.796 ±  0.981  ops/ms
RotateBenchmark.testRotateRightI          20        1024  thrpt   10  3540.899 ±  7.679  ops/ms
RotateBenchmark.testRotateRightIImm       20        1024  thrpt   10  4571.876 ±  0.879  ops/ms
RotateBenchmark.testRotateRightL          20        1024  thrpt   10  1731.499 ±  0.877  ops/ms
RotateBenchmark.testRotateRightLImm       20        1024  thrpt   10  2469.454 ±  0.705  ops/ms

This also moves all logical and shifting NEON instructions from `aarch64.ad` to `aarch64_neon.ad`,
and has two minor improvements of supporting vector length 4 for `vsraa8B_imm` and `vsrla8B_imm`, vector length 2 for `vsraa4S_imm` and `vsrla4S_imm`.

-------------

Commit messages:
 - 8256820: AArch64: Optimize vector rotate (immediate) with shift and insert instructions

Changes: https://git.openjdk.java.net/jdk/pull/1761/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1761&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8256820
  Stats: 2899 lines in 9 files changed: 1561 ins; 1014 del; 324 mod
  Patch: https://git.openjdk.java.net/jdk/pull/1761.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/1761/head:pull/1761

PR: https://git.openjdk.java.net/jdk/pull/1761