RFR: 8353266: C2: Wrong execution with Integer.bitCount(int) intrinsic on AArch64

Fri May 30 15:42:03 UTC 2025

### Problem

On Aarch64, using  `Integer.bitCount` can modify its argument. The problem comes from the implementation of `popCountI` on Aarch64. For instance, that's what we get with the reproducer `Reduced.java` on the related issue:

; Load lFld into local x
ldr  x11,      [x10, #120]
; popCountI
mov  w11,      w11
mov  v16.d[0], x11
cnt  v16.8b,   v16.8b
addv b16,      v16.8b
mov  x13,      v16.d[0]
; [...]
; store local x (which is believed to still contain lFld) into result
str  x11,      [x10, #128]

The instruction `mov  w11, w11` is used to cut the 32 higher bits of `x11` since we use `popCountI` (from `Integer.bitCount`): on aarch64 (like other architectures), assigning the 32 lower bits of a register reset the 32 higher bits. Short: the input is modified, but the implementation of `popCountI` doesn't declare it:

instruct popCountI(iRegINoSp dst, iRegIorL2I src, vRegF tmp) %{
  match(Set dst (PopCountI src));
  effect(TEMP tmp);
  [...]
%}

But then, why resetting the upper word of `x11`? It all starts with vector instructions:

cnt  v16.8b,   v16.8b
addv b16,      v16.8b

The `8b` specifies that it operates on the 8 lower bytes of `v16`, it would be nice to simply use `4b`, but that doesn't exist: vector instructions can only work on either the whole 128-bit register, or the 64 lower bits (by blocks of 1, 2, 4, 8 or 16 bytes). There is no suffix (and encoding) for a vector instruction to work only on the 32 lower bits, so not to pollute the bit count, we need to reset the 32 higher bits of `v16.d[0]` (aka `d16`), that is `v16.s[1]`, that is `v16[32:63]` in a more bit-explicit notation. Moreover, unlike with general purpose register doing

mov  v16.s[0], w11

would set `v16[0:31]` to `w11`, but not reset `v16[32:63]`. Which makes sense! Otherwise, using vector registers would be impractical if writing any piece would reset the rest... So we indeed need to set all of `v16[0:63]`, which

mov  w11,      w11
mov  v16.d[0], x11

does, but by destroying `x11`.

### Solution

Simply adding `USE_KILL src` in the effects would be nice, but unfortunately not possible: `iRegIorL2I` is an operand class (either a 32-bit register or a L2I of a 64-bit register) and those cannot be used in effect lists.

The way I went for is rather not to modify the source, but rather do write the two lower words of `v16` we are interested in separately:

mov  v16.s[1], wzr      ; Reset the 1-indexed word of v16, that is v16[32:63] <- 0
mov  v16.s[0], w11      ; Set the 0-indexed word of v16 to w11, that is v[0:31] <- w11
cnt  v16.8b,   v16.8b
addv b16,      v16.8b
mov  x13,      v16.s[0]

Unlike other solutions, this is relatively straightforward as it doesn't write twice the same bits, as for instance, this would:

mov  v16.d[0], xzr      ; Reset the 0-indexed double word of v16, that is v16[0:63] <- 0
mov  v16.s[0], w11      ; Set the 0-indexed word of v16 to w11, that is v[0:31] <- w11

and it doesn't use additional temporaries, like this would:

mov  w12,      w11      ; Using a fresh register x12
mov  v16.d[0], x12

Using the zero register rather than an immediate is convenient as it allows to set 32 bits at once, while a 32-bit immediate would not fit in a single instruction.

### Format

The printing of this instruction is not very satisfactory. We used to have something that renders in OptoAssembly

movw l2i(R29), l2i(R29)
mov  V16, l2i(R29) # vector (1D)
cnt  V16, V16      # vector (8B)
addv V16, V16      # vector (8B)
mov  R13, V16      # vector (1D)

This is... somewhat arguable. With context, I can understand or guess what `movw l2i(R29), l2i(R29)` means, but I don't think it's a very nice printout. Also, it's not clear that the second instruction works on the lower word of `V16`. Alas, my new version is not much better:

mov  V16, zr       # vector (1S)
mov  V16, l2i(R29) # vector (1S)
cnt  V16, V16      # vector (8B)
addv V16, V16      # vector (8B)
mov  R13, V16      # vector (1D)

It's not clear that the first instruction is on the 1-indexed word of `V16` while the second is on the 0-indexed word. I couldn't find a nicer example in a similar situation, so I'm open to suggestions! Maybe simply hardcoding it in the format? as such:

format %{ "mov    $tmp.s[1], zr\t# vector (1S)\n\t"
          "mov    $tmp.s[0], $src\t# vector (1S)\n\t"
          "cnt    $tmp, $tmp\t# vector (8B)\n\t"
          "addv   $tmp, $tmp\t# vector (8B)\n\t"
          "mov    $dst, $tmp\t# vector (1D)" %}

Not sure what's the best practice here.

-------------

Commit messages:
 - Add randomization
 - Adapt test
 - Add test
 - Don't change src, set directly in the vector register

Changes: https://git.openjdk.org/jdk/pull/25551/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25551&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8353266
  Stats: 84 lines in 2 files changed: 80 ins; 0 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/25551.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25551/head:pull/25551

PR: https://git.openjdk.org/jdk/pull/25551