RFR: 8353266: C2: Wrong execution with Integer.bitCount(int) intrinsic on AArch64

Sat May 31 03:13:53 UTC 2025

On Fri, 30 May 2025 15:33:14 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:

> ### Problem
> 
> On Aarch64, using  `Integer.bitCount` can modify its argument. The problem comes from the implementation of `popCountI` on Aarch64. For instance, that's what we get with the reproducer `Reduced.java` on the related issue:
> 
> ; Load lFld into local x
> ldr  x11,      [x10, #120]
> ; popCountI
> mov  w11,      w11
> mov  v16.d[0], x11
> cnt  v16.8b,   v16.8b
> addv b16,      v16.8b
> mov  x13,      v16.d[0]
> ; [...]
> ; store local x (which is believed to still contain lFld) into result
> str  x11,      [x10, #128]
> 
> 
> The instruction `mov  w11, w11` is used to cut the 32 higher bits of `x11` since we use `popCountI` (from `Integer.bitCount`): on aarch64 (like other architectures), assigning the 32 lower bits of a register reset the 32 higher bits. Short: the input is modified, but the implementation of `popCountI` doesn't declare it:
> 
> instruct popCountI(iRegINoSp dst, iRegIorL2I src, vRegF tmp) %{
>   match(Set dst (PopCountI src));
>   effect(TEMP tmp);
>   [...]
> %}
> 
> 
> But then, why resetting the upper word of `x11`? It all starts with vector instructions:
> 
> cnt  v16.8b,   v16.8b
> addv b16,      v16.8b
> 
> The `8b` specifies that it operates on the 8 lower bytes of `v16`, it would be nice to simply use `4b`, but that doesn't exist: vector instructions can only work on either the whole 128-bit register, or the 64 lower bits (by blocks of 1, 2, 4, 8 or 16 bytes). There is no suffix (and encoding) for a vector instruction to work only on the 32 lower bits, so not to pollute the bit count, we need to reset the 32 higher bits of `v16.d[0]` (aka `d16`), that is `v16.s[1]`, that is `v16[32:63]` in a more bit-explicit notation. Moreover, unlike with general purpose register doing
> 
> mov  v16.s[0], w11
> 
> would set `v16[0:31]` to `w11`, but not reset `v16[32:63]`. Which makes sense! Otherwise, using vector registers would be impractical if writing any piece would reset the rest... So we indeed need to set all of `v16[0:63]`, which
> 
> mov  w11,      w11
> mov  v16.d[0], x11
> 
> does, but by destroying `x11`.
> 
> ### Solution
> 
> Simply adding `USE_KILL src` in the effects would be nice, but unfortunately not possible: `iRegIorL2I` is an operand class (either a 32-bit register or a L2I of a 64-bit register) and those cannot be used in effect lists.
> 
> The way I went for is rather not to modify the source, but rather do write the two lower words of `v16` we are interested in separately:
> 
> mov  v16.s[1], wzr      ; Reset the 1-indexed word of v16, that is v16[32:63] <- 0
> mov  v16.s[0], w11      ; Set the 0-ind...

test/hotspot/jtreg/compiler/intrinsics/BitCountIAarch64PreservesArgument.java line 58:

> 56:             if (result != 0xfedc_ba98_7654_3210L) {
> 57:                 // Wrongly outputs the cut input 0x7654_3210 == 1985229328
> 58:                 throw new RuntimeException("Wrong result. lFld=" + lFld + "; result=" + result);

How about:

throw new RuntimeException("Wrong result. Expected result = " + lFld + "; Actual result = " + result);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25551#discussion_r2117132388