RFR: 8353266: C2: Wrong execution with Integer.bitCount(int) intrinsic on AArch64
Marc Chevalier
mchevalier at openjdk.org
Fri May 30 15:42:03 UTC 2025
### Problem
On Aarch64, using `Integer.bitCount` can modify its argument. The problem comes from the implementation of `popCountI` on Aarch64. For instance, that's what we get with the reproducer `Reduced.java` on the related issue:
; Load lFld into local x
ldr x11, [x10, #120]
; popCountI
mov w11, w11
mov v16.d[0], x11
cnt v16.8b, v16.8b
addv b16, v16.8b
mov x13, v16.d[0]
; [...]
; store local x (which is believed to still contain lFld) into result
str x11, [x10, #128]
The instruction `mov w11, w11` is used to cut the 32 higher bits of `x11` since we use `popCountI` (from `Integer.bitCount`): on aarch64 (like other architectures), assigning the 32 lower bits of a register reset the 32 higher bits. Short: the input is modified, but the implementation of `popCountI` doesn't declare it:
instruct popCountI(iRegINoSp dst, iRegIorL2I src, vRegF tmp) %{
match(Set dst (PopCountI src));
effect(TEMP tmp);
[...]
%}
But then, why resetting the upper word of `x11`? It all starts with vector instructions:
cnt v16.8b, v16.8b
addv b16, v16.8b
The `8b` specifies that it operates on the 8 lower bytes of `v16`, it would be nice to simply use `4b`, but that doesn't exist: vector instructions can only work on either the whole 128-bit register, or the 64 lower bits (by blocks of 1, 2, 4, 8 or 16 bytes). There is no suffix (and encoding) for a vector instruction to work only on the 32 lower bits, so not to pollute the bit count, we need to reset the 32 higher bits of `v16.d[0]` (aka `d16`), that is `v16.s[1]`, that is `v16[32:63]` in a more bit-explicit notation. Moreover, unlike with general purpose register doing
mov v16.s[0], w11
would set `v16[0:31]` to `w11`, but not reset `v16[32:63]`. Which makes sense! Otherwise, using vector registers would be impractical if writing any piece would reset the rest... So we indeed need to set all of `v16[0:63]`, which
mov w11, w11
mov v16.d[0], x11
does, but by destroying `x11`.
### Solution
Simply adding `USE_KILL src` in the effects would be nice, but unfortunately not possible: `iRegIorL2I` is an operand class (either a 32-bit register or a L2I of a 64-bit register) and those cannot be used in effect lists.
The way I went for is rather not to modify the source, but rather do write the two lower words of `v16` we are interested in separately:
mov v16.s[1], wzr ; Reset the 1-indexed word of v16, that is v16[32:63] <- 0
mov v16.s[0], w11 ; Set the 0-indexed word of v16 to w11, that is v[0:31] <- w11
cnt v16.8b, v16.8b
addv b16, v16.8b
mov x13, v16.s[0]
Unlike other solutions, this is relatively straightforward as it doesn't write twice the same bits, as for instance, this would:
mov v16.d[0], xzr ; Reset the 0-indexed double word of v16, that is v16[0:63] <- 0
mov v16.s[0], w11 ; Set the 0-indexed word of v16 to w11, that is v[0:31] <- w11
and it doesn't use additional temporaries, like this would:
mov w12, w11 ; Using a fresh register x12
mov v16.d[0], x12
Using the zero register rather than an immediate is convenient as it allows to set 32 bits at once, while a 32-bit immediate would not fit in a single instruction.
### Format
The printing of this instruction is not very satisfactory. We used to have something that renders in OptoAssembly
movw l2i(R29), l2i(R29)
mov V16, l2i(R29) # vector (1D)
cnt V16, V16 # vector (8B)
addv V16, V16 # vector (8B)
mov R13, V16 # vector (1D)
This is... somewhat arguable. With context, I can understand or guess what `movw l2i(R29), l2i(R29)` means, but I don't think it's a very nice printout. Also, it's not clear that the second instruction works on the lower word of `V16`. Alas, my new version is not much better:
mov V16, zr # vector (1S)
mov V16, l2i(R29) # vector (1S)
cnt V16, V16 # vector (8B)
addv V16, V16 # vector (8B)
mov R13, V16 # vector (1D)
It's not clear that the first instruction is on the 1-indexed word of `V16` while the second is on the 0-indexed word. I couldn't find a nicer example in a similar situation, so I'm open to suggestions! Maybe simply hardcoding it in the format? as such:
format %{ "mov $tmp.s[1], zr\t# vector (1S)\n\t"
"mov $tmp.s[0], $src\t# vector (1S)\n\t"
"cnt $tmp, $tmp\t# vector (8B)\n\t"
"addv $tmp, $tmp\t# vector (8B)\n\t"
"mov $dst, $tmp\t# vector (1D)" %}
Not sure what's the best practice here.
-------------
Commit messages:
- Add randomization
- Adapt test
- Add test
- Don't change src, set directly in the vector register
Changes: https://git.openjdk.org/jdk/pull/25551/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25551&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8353266
Stats: 84 lines in 2 files changed: 80 ins; 0 del; 4 mod
Patch: https://git.openjdk.org/jdk/pull/25551.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/25551/head:pull/25551
PR: https://git.openjdk.org/jdk/pull/25551
More information about the hotspot-compiler-dev
mailing list