RFR: 8296999: AArch64: scalar intrinsics for reverse method in Integer and Long [v2]

Mon Jan 30 02:59:39 UTC 2023

> x86 implemented the scalar intrinsics for reverse() method in
>  java.lang.Integer and java.lang.Long. See JDK-8290034 [1].
> 
> In this patch, we implement the AArch64 backend part
>  using `rbit` intruction [2].
> 
> TestReverseBitsVector.java was introduced in [1] to verify the
>  IR test results of auto-vectorization and mid-end optimizations.
> In this patch, we update it to test AArch64 as well.
> 
> Tests:
> 1: These scalar intrinsics can be covered by existing Jtreg cases,
>  e.g. [3][4]. Hence, we don't add new one in this patch.
> 2: tier1~3 pass on Linux/AArch64 and Linux/x86. There are no new failures.
> 3: All the vector test cases under the following directories pass on
>  128-bit and 256-bit SVE machines.
> 
> 
>   test/hotspot/jtreg/compiler/vectorapi/
>   test/jdk/jdk/incubator/vector/
>   test/hotspot/jtreg/compiler/vectorization/
> 
> 
> 4: JMH case
> We initially use the JMH case from [1] (i.e.Integers.reverse
>  and Longs.reverse) to evaluate the performance uplifts after
> enabling these scalar intrinsics. From the data shown below,
>  about 5x and 6x performance uplifts can be perceived respectively.
> 
> 
> Benchmark              (size) Mode  Before      After       Units
> Integers.reverse        500   avgt  0.456±0.002 0.080±0.001 us/op
> Longs.reverse           500   avgt  0.898±0.009 0.142±0.001 us/op
> 
> 
> With an in-depth analysis, we notice that the benefit comes from
>  auto-vectorization (SLP) improvement. Note that the loops in the
>  two benchmarks can be vectorized by SLP. Without the scalar intrinsics,
>  the vector version of the Java implementation [5][6] would be generated,
>  below is a code snippet of it.
> 
> 
> and   v17.16b, v16.16b, v18.16b
> ushr  v16.4s, v16.4s, #1
> and   v16.16b, v16.16b, v18.16b
> shl   v17.4s, v17.4s, #1
> orr   v16.16b, v17.16b, v16.16b
> 
> 
> With the introduction of scalar intrinsics, ReverseI and ReverseL
>  IR nodes can be created at mid-end. As a result, SLP could generate
>  ReverseV node, i.e. generating "rbitv" instruction, which is much
>  more efficient than previous instruction sequence. Hence, we can say
>  that the introduction of these two scalar intrinsics can improve SLP
>  to generate better code. It's an indirect effect of this patch.
> 
> Furthermore, in order to evaluate the direct effect of the scalar
>  intrinsics, we
> (1) evaluate a small test case which is not auto-vectorization friendly.
> (2）evaluate Integers.reverse and Longs.reverse in [1] with JVM option
>  "-XX:-UseSuperWord" to disable SLP.
> 
> In both cases, we observe about 5x performance uplifts after enabling
>  the scalar instrinics.
> 
> 
> Benchmark              (size) Mode  Before      After       Units
> Integers.reverse        500   avgt  1.072±0.002 0.212±0.001 us/op
> (disable SLP)
> Longs.reverse           500   avgt  1.073±0.002 0.212±0.001 us/op
> (disable SLP)
> 
> 
> [1] https://bugs.openjdk.org/browse/JDK-8290034
> [2] https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/RBIT--Reverse-Bits-?lang=en
> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/IntMaxVectorTests.java#L1228
> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/LongMaxVectorTests.java#L1250
> [5] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Integer.java#L1766
> [6] https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Long.java#L1905

Chang Peng has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:

 - Merge branch 'openjdk:master' into add_reverse_bit
 - 8296999: AArch64: scalar intrinsics for reverse method in Integer and Long

   x86 implemented the scalar intrinsics for reverse() method in
   java.lang.Integer and java.lang.Long. See JDK-8290034 [1].

   In this patch, we implement the AArch64 backend part
   using `rbit` intruction [2].

   TestReverseBitsVector.java was introduced in [1] to verify the
   IR test results of auto-vectorization and mid-end optimizations.
   In this patch, we update it to test AArch64 as well.

   Tests:
   1: These scalar intrinsics can be covered by existing Jtreg cases,
   e.g. [3][4]. Hence, we don't add new one in this patch.
   2: tier1~3 pass on Linux/AArch64 and Linux/x86. There are no new failures.
   3: All the vector test cases under the following directories pass on
   128-bit and 256-bit SVE machines.

   ```
     test/hotspot/jtreg/compiler/vectorapi/
     test/jdk/jdk/incubator/vector/
     test/hotspot/jtreg/compiler/vectorization/
   ```

   4: JMH results: we initially use the JMH case from [1] (i.e.
   Integers.reverse and Longs.reverse) to evaluate the performance
   uplifts after enabling these scalar intrinsics. From the data
   shown below, about 5x and 6x performance uplifts can be obtained
   respectively. However, the benefit comes from auto-vectorization (SLP).
   That is, ReverseV node can be generated after enabling the newly
   added scalar intrinsics.

   Therefore, in order to evaluate the scalar intrinsics, we firstly
   evaluate the performance on test cases designed to be not
   auto-vectorization friendly. Then, we evaluate the performance
   on Integers.reverse and Longs.reverse when using JVM option
   "-XX:-UseSuperWord" to disable SLP. We both get about
   5x performance uplifts.

   ```
   Benchmark              (size) Mode  Before      After        Units
   Integers.reverse        500   avgt  0.456±0.002 0.080±000.1  us/op
   Longs.reverse           500   avgt  0.898±0.009 0.142±0.0 0 us/op
   Integers.reverse        500   avgt  1.072±0002 0.212±0.01  us/op
   (disable SLP)
   Longs.reverse           500   avgt  1.073±0.02 0.212±0.01  us/op
   (disable SLP)
   ```

   [1] https://bugs.openjdk.org/browse/JDK-8290034
   [2] https://developer.arm.com/documentation/ddi0602/2022-12/Base-Instructions/RBIT--Reverse-Bits-?lang=en
   [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/IntMaxVectorTests.java#L1228
   [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/LongMaxVectorTests.java#L1250

   TEST_LABEL: x86_64&&ubuntu&&conformance
   JDK_SCOPE: hotspot:compiler/vectorization/TestReverseBitsVector.java

   Jira: ENTLLT-5736
   CustomizedGitHooks: yes
   Change-Id: Ic6620d81e787def391d19db07fce53e1e82a0e43

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/11962/files
  - new: https://git.openjdk.org/jdk/pull/11962/files/550634e7..f8cf8c1f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=11962&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=11962&range=00-01

  Stats: 41719 lines in 2013 files changed: 16962 ins; 6221 del; 18536 mod
  Patch: https://git.openjdk.org/jdk/pull/11962.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/11962/head:pull/11962

PR: https://git.openjdk.org/jdk/pull/11962