RFR: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`
Fei Gao
fgao at openjdk.org
Tue Nov 29 02:32:47 UTC 2022
Background:
Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one.
We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below.
1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`
In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type.
For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression.
2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform
These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when
`-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable.
3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits
Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when
`-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like:
LOOP:
sxtw x13, w12
add x14, x15, x13, uxtx #3
add x17, x14, #0x10
ld1d {z16.d}, p7/z, [x17]
// Incorrectly use integer rbit/clz insn for long type vector
*rbit z16.s, p7/m, z16.s
*clz z16.s, p7/m, z16.s
add x13, x16, x13, uxtx #2
str q16, [x13, #16]
...
add w12, w12, #0x20
cmp w12, w3
b.lt LOOP
It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected:
LOOP:
sxtw x13, w12
add x14, x15, x13, uxtx #3
add x17, x14, #0x10
ld1d {z16.d}, p7/z, [x17]
// Compute with long vector type and convert to int vector type
*rbit z16.d, p7/m, z16.d
*clz z16.d, p7/m, z16.d
*mov z24.d, #0
*uzp1 z25.s, z16.s, z24.s
add x13, x16, x13, uxtx #2
str q25, [x13, #16]
...
add w12, w12, #0x20
cmp w12, w3
b.lt LOOP
4. Fix an assertion failure on x86 avx2 platform
Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like:
// long[] ia;
// int[] ic;
for (int i = 0; i < LENGTH; ++i) {
ic[i] = Long.numberOfLeadingZeros(ia[i]);
}
X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally.
Tier 1~3 passed with no new failures on Linux AArch64/X86 platform.
[1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long)
https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long)
https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long)
[2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687
[3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418
[4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239
-------------
Commit messages:
- 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`
Changes: https://git.openjdk.org/jdk/pull/11405/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11405&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8297172
Stats: 303 lines in 11 files changed: 161 ins; 131 del; 11 mod
Patch: https://git.openjdk.org/jdk/pull/11405.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/11405/head:pull/11405
PR: https://git.openjdk.org/jdk/pull/11405
More information about the hotspot-compiler-dev
mailing list