Integrated: 8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`

Fei Gao fgao at openjdk.org
Tue Dec 6 09:40:23 UTC 2022


On Tue, 29 Nov 2022 02:22:35 GMT, Fei Gao <fgao at openjdk.org> wrote:

> Background:
> 
> Java API[1] for `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one.
> 
> We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below.
> 
> 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`
> 
> In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type.
> 
> For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression.
> 
> 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform
> 
> These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when
> `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto-vectorization is profitable.
> 
> 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits
> 
> Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when
> `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like:
> 
> LOOP:
>   sxtw  x13, w12
>   add   x14, x15, x13, uxtx #3
>   add   x17, x14, #0x10
>   ld1d  {z16.d}, p7/z, [x17]
>   // Incorrectly use integer rbit/clz insn for long type vector
>  *rbit  z16.s, p7/m, z16.s
>  *clz   z16.s, p7/m, z16.s
>   add   x13, x16, x13, uxtx #2
>   str   q16, [x13, #16]
>   ...
>   add   w12, w12, #0x20
>   cmp   w12, w3
>   b.lt  LOOP
> 
> 
> It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected:
> 
> LOOP:
>   sxtw  x13, w12
>   add   x14, x15, x13, uxtx #3
>   add   x17, x14, #0x10
>   ld1d  {z16.d}, p7/z, [x17]
>   // Compute with long vector type and convert to int vector type
>  *rbit  z16.d, p7/m, z16.d
>  *clz   z16.d, p7/m, z16.d
>  *mov   z24.d, #0
>  *uzp1  z25.s, z16.s, z24.s
>   add   x13, x16, x13, uxtx #2
>   str   q25, [x13, #16]
>   ...
>   add   w12, w12, #0x20
>   cmp   w12, w3
>   b.lt  LOOP
> 
> 
> 4. Fix an assertion failure on x86 avx2 platform
> 
> Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like:
> 
> //  long[] ia;
> //  int[] ic;
>     for (int i = 0; i < LENGTH; ++i) {
>       ic[i] = Long.numberOfLeadingZeros(ia[i]);
>     }
> 
> 
> X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally.
> 
> Tier 1~3 passed with no new failures on Linux AArch64/X86 platform.
> 
> [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long)
>     https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long)
>     https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long)
> [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687
> [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418
> [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239

This pull request has now been integrated.

Changeset: 4458de95
Author:    Fei Gao <fgao at openjdk.org>
Committer: Pengfei Li <pli at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/4458de95f845c036c1c8e28df7043e989beaee98
Stats:     303 lines in 11 files changed: 161 ins; 131 del; 11 mod

8297172: Fix some issues of auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()`

Reviewed-by: kvn, thartmann

-------------

PR: https://git.openjdk.org/jdk/pull/11405


More information about the hotspot-compiler-dev mailing list