RFR: 8353266: C2: Wrong execution with Integer.bitCount(int) intrinsic on AArch64 [v2]
Andrew Haley
aph at openjdk.org
Mon Jun 2 11:56:52 UTC 2025
On Mon, 2 Jun 2025 10:37:11 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:
>> ### Problem
>>
>> On Aarch64, using `Integer.bitCount` can modify its argument. The problem comes from the implementation of `popCountI` on Aarch64. For instance, that's what we get with the reproducer `Reduced.java` on the related issue:
>>
>> ; Load lFld into local x
>> ldr x11, [x10, #120]
>> ; popCountI
>> mov w11, w11
>> mov v16.d[0], x11
>> cnt v16.8b, v16.8b
>> addv b16, v16.8b
>> mov x13, v16.d[0]
>> ; [...]
>> ; store local x (which is believed to still contain lFld) into result
>> str x11, [x10, #128]
>>
>>
>> The instruction `mov w11, w11` is used to cut the 32 higher bits of `x11` since we use `popCountI` (from `Integer.bitCount`): on aarch64 (like other architectures), assigning the 32 lower bits of a register reset the 32 higher bits. Short: the input is modified, but the implementation of `popCountI` doesn't declare it:
>>
>> instruct popCountI(iRegINoSp dst, iRegIorL2I src, vRegF tmp) %{
>> match(Set dst (PopCountI src));
>> effect(TEMP tmp);
>> [...]
>> %}
>>
>>
>> But then, why resetting the upper word of `x11`? It all starts with vector instructions:
>>
>> cnt v16.8b, v16.8b
>> addv b16, v16.8b
>>
>> The `8b` specifies that it operates on the 8 lower bytes of `v16`, it would be nice to simply use `4b`, but that doesn't exist: vector instructions can only work on either the whole 128-bit register, or the 64 lower bits (by blocks of 1, 2, 4, 8 or 16 bytes). There is no suffix (and encoding) for a vector instruction to work only on the 32 lower bits, so not to pollute the bit count, we need to reset the 32 higher bits of `v16.d[0]` (aka `d16`), that is `v16.s[1]`, that is `v16[32:63]` in a more bit-explicit notation. Moreover, unlike with general purpose register doing
>>
>> mov v16.s[0], w11
>>
>> would set `v16[0:31]` to `w11`, but not reset `v16[32:63]`. Which makes sense! Otherwise, using vector registers would be impractical if writing any piece would reset the rest... So we indeed need to set all of `v16[0:63]`, which
>>
>> mov w11, w11
>> mov v16.d[0], x11
>>
>> does, but by destroying `x11`.
>>
>> ### Solution
>>
>> Simply adding `USE_KILL src` in the effects would be nice, but unfortunately not possible: `iRegIorL2I` is an operand class (either a 32-bit register or a L2I of a 64-bit register) and those cannot be used in effect lists.
>>
>> The way I went for is rather not to modify the source, but rather do write the two lower words of `v16` we are interested in separately:
>>
>> mov v16.s[1], wzr ...
>
> Marc Chevalier has updated the pull request incrementally with one additional commit since the last revision:
>
> Apply suggestions
Marked as reviewed by aph (Reviewer).
-------------
PR Review: https://git.openjdk.org/jdk/pull/25551#pullrequestreview-2888056446
More information about the hotspot-compiler-dev
mailing list