RFR: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`

Fei Gao fgao at openjdk.org
Tue Nov 29 02:32:47 UTC 2022


Background:

Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one.

We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below.

1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`

In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type.

For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression.

2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform

These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when
`-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable.

3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits

Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when
`-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like:

LOOP:
  sxtw  x13, w12
  add   x14, x15, x13, uxtx #3
  add   x17, x14, #0x10
  ld1d  {z16.d}, p7/z, [x17]
  // Incorrectly use integer rbit/clz insn for long type vector
 *rbit  z16.s, p7/m, z16.s
 *clz   z16.s, p7/m, z16.s
  add   x13, x16, x13, uxtx #2
  str   q16, [x13, #16]
  ...
  add   w12, w12, #0x20
  cmp   w12, w3
  b.lt  LOOP


It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected:

LOOP:
  sxtw  x13, w12
  add   x14, x15, x13, uxtx #3
  add   x17, x14, #0x10
  ld1d  {z16.d}, p7/z, [x17]
  // Compute with long vector type and convert to int vector type
 *rbit  z16.d, p7/m, z16.d
 *clz   z16.d, p7/m, z16.d
 *mov   z24.d, #0
 *uzp1  z25.s, z16.s, z24.s
  add   x13, x16, x13, uxtx #2
  str   q25, [x13, #16]
  ...
  add   w12, w12, #0x20
  cmp   w12, w3
  b.lt  LOOP


4. Fix an assertion failure on x86 avx2 platform

Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like:

//  long[] ia;
//  int[] ic;
    for (int i = 0; i < LENGTH; ++i) {
      ic[i] = Long.numberOfLeadingZeros(ia[i]);
    }


X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally.

Tier 1~3 passed with no new failures on Linux AArch64/X86 platform.

[1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long)
    https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long)
    https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long)
[2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687
[3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418
[4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239

-------------

Commit messages:
 - 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`

Changes: https://git.openjdk.org/jdk/pull/11405/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11405&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8297172
  Stats: 303 lines in 11 files changed: 161 ins; 131 del; 11 mod
  Patch: https://git.openjdk.org/jdk/pull/11405.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/11405/head:pull/11405

PR: https://git.openjdk.org/jdk/pull/11405


More information about the hotspot-compiler-dev mailing list