RFR: 8296999: AArch64: scalar intrinsics for reverse method in Integer and Long

Fri Jan 27 22:40:58 UTC 2023

x86 implemented the scalar intrinsics for reverse() method in
 java.lang.Integer and java.lang.Long. See JDK-8290034 [1].

In this patch, we implement the AArch64 backend part
 using `rbit` intruction [2].

TestReverseBitsVector.java was introduced in [1] to verify the
 IR test results of auto-vectorization and mid-end optimizations.
In this patch, we update it to test AArch64 as well.

Tests:
1: These scalar intrinsics can be covered by existing Jtreg cases,
 e.g. [3][4]. Hence, we don't add new one in this patch.
2: tier1~3 pass on Linux/AArch64 and Linux/x86. There are no new failures.
3: All the vector test cases under the following directories pass on
 128-bit and 256-bit SVE machines.

  test/hotspot/jtreg/compiler/vectorapi/
  test/jdk/jdk/incubator/vector/
  test/hotspot/jtreg/compiler/vectorization/

4: JMH case
We initially use the JMH case from [1] (i.e.Integers.reverse
 and Longs.reverse) to evaluate the performance uplifts after
enabling these scalar intrinsics. From the data shown below,
 about 5x and 6x performance uplifts can be perceived respectively.

Benchmark              (size) Mode  Before      After       Units
Integers.reverse        500   avgt  0.456±0.002 0.080±0.001 us/op
Longs.reverse           500   avgt  0.898±0.009 0.142±0.001 us/op

With an in-depth analysis, we notice that the benefit comes from
 auto-vectorization (SLP) improvement. Note that the loops in the
 two benchmarks can be vectorized by SLP. Without the scalar intrinsics,
 the vector version of the Java implementation [5][6] would be generated,
 below is a code snippet of it.

and   v17.16b, v16.16b, v18.16b
ushr  v16.4s, v16.4s, #1
and   v16.16b, v16.16b, v18.16b
shl   v17.4s, v17.4s, #1
orr   v16.16b, v17.16b, v16.16b

With the introduction of scalar intrinsics, ReverseI and ReverseL
 IR nodes can be created at mid-end. As a result, SLP could generate
 ReverseV node, i.e. generating "rbitv" instruction, which is much
 more efficient than previous instruction sequence. Hence, we can say
 that the introduction of these two scalar intrinsics can improve SLP
 to generate better code. It's an indirect effect of this patch.

Furthermore, in order to evaluate the direct effect of the scalar
 intrinsics, we
(1) evaluate a small test case which is not auto-vectorization friendly.
(2）evaluate Integers.reverse and Longs.reverse in [1] with JVM option
 "-XX:-UseSuperWord" to disable SLP.

In both cases, we observe about 5x performance uplifts after enabling
 the scalar instrinics.

Benchmark              (size) Mode  Before      After       Units
Integers.reverse        500   avgt  1.072±0.002 0.212±0.001 us/op
(disable SLP)
Longs.reverse           500   avgt  1.073±0.002 0.212±0.001 us/op
(disable SLP)

[1] https://bugs.openjdk.org/browse/JDK-8290034
[2] https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/RBIT--Reverse-Bits-?lang=en
[3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/IntMaxVectorTests.java#L1228
[4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/LongMaxVectorTests.java#L1250
[5] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Integer.java#L1766
[6] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Long.java#L1905

-------------

Commit messages:
 - 8296999: AArch64: scalar intrinsics for reverse method in Integer and Long

Changes: https://git.openjdk.org/jdk/pull/11962/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11962&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8296999
  Stats: 31 lines in 2 files changed: 23 ins; 0 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/11962.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/11962/head:pull/11962

PR: https://git.openjdk.org/jdk/pull/11962