RFR: 8322770: Implement C2 VectorizedHashCode on AArch64
Andrew Haley
aph at openjdk.org
Wed Apr 24 16:33:28 UTC 2024
On Tue, 26 Mar 2024 13:59:12 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:
> Hello,
>
> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively.
>
> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
>
> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
>
> # Performance
>
> ## Neoverse N1
>
>
> --------------------------------------------------------------------------------------------
> Version Baseline This patch
> --------------------------------------------------------------------------------------------
> Benchmark (size) Mode Cnt Score Error Score Error Units
> --------------------------------------------------------------------------------------------
> ArraysHashCode.bytes 1 avgt 15 1.249 ? 0.060 1.247 ? 0.062 ns/op
> ArraysHashCode.bytes 10 avgt 15 8.754 ? 0.028 4.387 ? 0.015 ns/op
> ArraysHashCode.bytes 100 avgt 15 98.596 ? 0.051 26.655 ? 0.097 ns/op
> ArraysHashCode.bytes 10000 avgt 15 10150.578 ? 1.352 2649.962 ? 216.744 ns/op
> ArraysHashCode.chars 1 avgt 15 1.286 ? 0.062 1.246 ? 0.054 ns/op
> ArraysHashCode.chars 10 avgt 15 8.731 ? 0.002 5.344 ? 0.003 ns/op
> ArraysHashCode.chars 100 avgt 15 98.632 ? 0.048 23.023 ? 0.142 ns/op
> ArraysHashCode.chars 10000 avgt 15 10150.658 ? 3.374 2410.504 ? 8.872 ns/op
> ArraysHashCode.ints 1 avgt 15 1.189 ? 0.005 1.187 ? 0.001 ns/op
> ArraysHashCode.ints 10 avgt 15 8.730 ? 0.002 5.676 ? 0.001 ns/op
> ArraysHashCode.ints 100 avgt 15 98.559 ? 0.016 24.378 ? 0.006 ns/op
> ArraysHashCode.ints 10000 avgt 15 10148.752 ? 1.336 2419.015 ? 0.492 ns/op
> ArraysHashCode.multibytes 1 avgt 15 1.037 ? 0.001 1.037 ? 0.001 ns/op
> ArraysHashCode.multibytes 10 avgt 15 5.4...
> > You only need one load, add, and multiply per iteration.
> > You don't need to add across columns until the end.
>
> @theRealAph , I've tried to follow the suggested approach, please find the patch and result in [mikabl-arm at e352f30](https://github.com/mikabl-arm/jdk/commit/e352f30d89e99417231ae7bb66b325c68a76eef9) .
>
> So far I wasn't able to see any performance benefits compared to an implementations from this MR.
Yeah, true. I can see why that's happening from prof perfnorm:
4.30% ↗ 0x0000ffff70b3cdec: mul v1.4s, v1.4s, v3.4s
0.45% │ 0x0000ffff70b3cdf0: ld1 {v0.4s}, [x1], #16
81.54% │ 0x0000ffff70b3cdf4: add v1.4s, v1.4s, v0.4s
4.83% │ 0x0000ffff70b3cdf8: subs w2, w2, #4
3.55% ╰ 0x0000ffff70b3cdfc: b.hs #0xffff70b3cdec
ArraysHashCode.ints:IPC 1024 avgt 1.395 insns/clk
This is 1.4 insns/clk on a machine that can run 8 insns/clk. Because we're doing one load, then the MAC, then another load after the MAC, then a MAC that depends on the load: we stall the whole core waiting for the next load. Everything is serialized. Neoverse looks the same as Apple M1 here.
I guess the real question here is what we want. x86's engineers get this:
Benchmark (size) Mode Cnt Score Error Units
ArraysHashCode.ints 1 avgt 5 0.834 ± 0.001 ns/op
ArraysHashCode.ints 10 avgt 5 5.500 ± 0.016 ns/op
ArraysHashCode.ints 100 avgt 5 20.330 ± 0.103 ns/op
ArraysHashCode.ints 10000 avgt 5 1365.347 ± 1.045 ns/op
(And that's on my desktop box from 2018, an inferior piece of hardware.)
This is how they do it:
↗ 0x00007f0634c21c17: imul ebx,r11d
0.02% │ 0x00007f0634c21c1b: vmovdqu ymm12,YMMWORD PTR [rdi+rsi*4]
│ 0x00007f0634c21c20: vmovdqu ymm2,YMMWORD PTR [rdi+rsi*4+0x20]
5.36% │ 0x00007f0634c21c26: vmovdqu ymm0,YMMWORD PTR [rdi+rsi*4+0x40]
│ 0x00007f0634c21c2c: vmovdqu ymm1,YMMWORD PTR [rdi+rsi*4+0x60]
0.05% │ 0x00007f0634c21c32: vpmulld ymm8,ymm8,ymm3
11.12% │ 0x00007f0634c21c37: vpaddd ymm8,ymm8,ymm12
4.97% │ 0x00007f0634c21c3c: vpmulld ymm9,ymm9,ymm3
15.09% │ 0x00007f0634c21c41: vpaddd ymm9,ymm9,ymm2
5.16% │ 0x00007f0634c21c45: vpmulld ymm10,ymm10,ymm3
15.51% │ 0x00007f0634c21c4a: vpaddd ymm10,ymm10,ymm0
5.44% │ 0x00007f0634c21c4e: vpmulld ymm11,ymm11,ymm3
16.39% │ 0x00007f0634c21c53: vpaddd ymm11,ymm11,ymm1
4.80% │ 0x00007f0634c21c57: add esi,0x20
│ 0x00007f0634c21c5a: cmp esi,ecx
╰ 0x00007f0634c21c5c: jl 0x00007f0634c21c17
So, do we want to try to beat them on Arm, or not? They surely want to beat Arm.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2075376464
More information about the hotspot-dev
mailing list