RFR: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`
Vladimir Kozlov
kvn at openjdk.org
Mon Dec 5 19:50:32 UTC 2022
On Tue, 29 Nov 2022 02:22:35 GMT, Fei Gao <fgao at openjdk.org> wrote:
> Background:
>
> Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one.
>
> We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below.
>
> 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`
>
> In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type.
>
> For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression.
>
> 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform
>
> These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when
> `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable.
>
> 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits
>
> Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when
> `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like:
>
> LOOP:
> sxtw x13, w12
> add x14, x15, x13, uxtx #3
> add x17, x14, #0x10
> ld1d {z16.d}, p7/z, [x17]
> // Incorrectly use integer rbit/clz insn for long type vector
> *rbit z16.s, p7/m, z16.s
> *clz z16.s, p7/m, z16.s
> add x13, x16, x13, uxtx #2
> str q16, [x13, #16]
> ...
> add w12, w12, #0x20
> cmp w12, w3
> b.lt LOOP
>
>
> It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected:
>
> LOOP:
> sxtw x13, w12
> add x14, x15, x13, uxtx #3
> add x17, x14, #0x10
> ld1d {z16.d}, p7/z, [x17]
> // Compute with long vector type and convert to int vector type
> *rbit z16.d, p7/m, z16.d
> *clz z16.d, p7/m, z16.d
> *mov z24.d, #0
> *uzp1 z25.s, z16.s, z24.s
> add x13, x16, x13, uxtx #2
> str q25, [x13, #16]
> ...
> add w12, w12, #0x20
> cmp w12, w3
> b.lt LOOP
>
>
> 4. Fix an assertion failure on x86 avx2 platform
>
> Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like:
>
> // long[] ia;
> // int[] ic;
> for (int i = 0; i < LENGTH; ++i) {
> ic[i] = Long.numberOfLeadingZeros(ia[i]);
> }
>
>
> X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally.
>
> Tier 1~3 passed with no new failures on Linux AArch64/X86 platform.
>
> [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long)
> https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long)
> https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long)
> [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687
> [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418
> [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239
Very nicely done.
I suggest to wait approval from @TobiHartmann after his testing is finished.
-------------
Marked as reviewed by kvn (Reviewer).
PR: https://git.openjdk.org/jdk/pull/11405
More information about the hotspot-compiler-dev
mailing list