RFR: 8353266: C2: Wrong execution with Integer.bitCount(int) intrinsic on AArch64

Sat May 31 14:31:50 UTC 2025

On Fri, 30 May 2025 15:33:14 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

> ### Problem
> 
> On Aarch64, using  `Integer.bitCount` can modify its argument. The problem comes from the implementation of `popCountI` on Aarch64. For instance, that's what we get with the reproducer `Reduced.java` on the related issue:
> 
> ; Load lFld into local x
> ldr  x11,      [x10, #120]
> ; popCountI
> mov  w11,      w11
> mov  v16.d[0], x11
> cnt  v16.8b,   v16.8b
> addv b16,      v16.8b
> mov  x13,      v16.d[0]
> ; [...]
> ; store local x (which is believed to still contain lFld) into result
> str  x11,      [x10, #128]
> 
> 
> The instruction `mov  w11, w11` is used to cut the 32 higher bits of `x11` since we use `popCountI` (from `Integer.bitCount`): on aarch64 (like other architectures), assigning the 32 lower bits of a register reset the 32 higher bits. Short: the input is modified, but the implementation of `popCountI` doesn't declare it:
> 
> instruct popCountI(iRegINoSp dst, iRegIorL2I src, vRegF tmp) %{
>   match(Set dst (PopCountI src));
>   effect(TEMP tmp);
>   [...]
> %}
> 
> 
> But then, why resetting the upper word of `x11`? It all starts with vector instructions:
> 
> cnt  v16.8b,   v16.8b
> addv b16,      v16.8b
> 
> The `8b` specifies that it operates on the 8 lower bytes of `v16`, it would be nice to simply use `4b`, but that doesn't exist: vector instructions can only work on either the whole 128-bit register, or the 64 lower bits (by blocks of 1, 2, 4, 8 or 16 bytes). There is no suffix (and encoding) for a vector instruction to work only on the 32 lower bits, so not to pollute the bit count, we need to reset the 32 higher bits of `v16.d[0]` (aka `d16`), that is `v16.s[1]`, that is `v16[32:63]` in a more bit-explicit notation. Moreover, unlike with general purpose register doing
> 
> mov  v16.s[0], w11
> 
> would set `v16[0:31]` to `w11`, but not reset `v16[32:63]`. Which makes sense! Otherwise, using vector registers would be impractical if writing any piece would reset the rest... So we indeed need to set all of `v16[0:63]`, which
> 
> mov  w11,      w11
> mov  v16.d[0], x11
> 
> does, but by destroying `x11`.
> 
> ### Solution
> 
> Simply adding `USE_KILL src` in the effects would be nice, but unfortunately not possible: `iRegIorL2I` is an operand class (either a 32-bit register or a L2I of a 64-bit register) and those cannot be used in effect lists.
> 
> The way I went for is rather not to modify the source, but rather do write the two lower words of `v16` we are interested in separately:
> 
> mov  v16.s[1], wzr      ; Reset the 1-indexed word of v16, that is v16[32:63] <- 0
> mov  v16.s[0], w11      ; Set the 0-ind...

src/hotspot/cpu/aarch64/aarch64.ad line 7771:

> 7769:   ins_encode %{
> 7770:     __ mov($tmp$$FloatRegister, __ S, 1, zr);             // tmp[32:63] <- 0
> 7771:     __ mov($tmp$$FloatRegister, __ S, 0, $src$$Register); // tmp[ 0:31] <- src

"Where the entire 128-bit wide register is not fully utilized, the vector or scalar quantity is held in the least significant bits of the register, with the most significant bits being cleared to zero on a write."

Suggestion:

    __ fmovs($tmp$$FloatRegister, $src$$Register);

should do it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25551#discussion_r2117909516