RFR: 8353237: [AArch64] Incorrect result of VectorizedHashCode intrinsic on Cortex-A53

Wed Apr 9 13:02:32 UTC 2025

On Mon, 7 Apr 2025 12:39:40 GMT, Aleksei Voitylov <avoitylov at openjdk.org> wrote:

> The root of the problem is that VectorizedHashCode intrinsic introduced by JDK-8341194 is not aware of JDK-8079203. JDK-8079203 generates additional nop with madd instruction on Cortex-A53 as a workaround for Cortex-A53 erratum 835769 "AArch64 multiply-accumulate instruction might produce incorrect result". Current VectorizedHashCode intrinsic calculates byte offset to jump inside the unrolled loop code. It assumes 2 instructions per each unrolled iteration (load and madd). JDK-8079203 adds additional nop for Cortex-A53, which breaks offset calculation logic.
>  
> Current offset calculation logic is using shift instead of multiplication, power-of-2 number instructions are present in each unrolled loop iteration. To keep it simple, this fix adds one more nop into each loop iteration on Cortex-A53 in order to have 4 instruction per iteration, which is also a power-of-2. To account for that, the shift argument for offset calculation logic is increased by 1, because each loop iteration has 2 times more instructions on Cortex-A53.
>  
> This fix is tested on Raspberry Pi 3 (based on Cortex-A53) by running initially reported application and by running hotspot jtreg tests (not a single test could be run on Cortex-A53 before the fix). After the fix, the specialized test hotspot/jtreg/compiler/intrinsics/TestArraysHashCode.java passes.
> 
> The performance gain from the intrinsic is also observed on Cortex-A53 using the ArraysHashCode benchmark.

It will have to be backported to 21u, as the intrinsic is already in 21u-dev. I'll do that once we address 8353237 in jdk/jdk.

Our cross compiler has, among other erratum flags, `-mfix-cortex-a53-835769` on by default. IIRC Linaro also does this, but hard to tell if all vendors building OpenJDK do that.

Note that JDK-8079203 does not check the specific erratum flags in a processor. Instead, it adds nop based on processor type, penalizing all A53s. I could not find the history of that decision, maybe because not all A53 implementors used the erratum flags, or for the sake of simplicity. It's also consistent with a typical GCC setting. I decided not to touch this part because of that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24489#issuecomment-2789622909