RFR: 8322770: Implement C2 VectorizedHashCode on AArch64

Wed Apr 24 16:33:28 UTC 2024

On Tue, 26 Mar 2024 13:59:12 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:

> Hello,
> 
> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively. 
> 
> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
> 
> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
> 
> # Performance
> 
> ## Neoverse N1
> 
> 
>   --------------------------------------------------------------------------------------------
>   Version                                            Baseline           This patch
>   --------------------------------------------------------------------------------------------
>   Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
>   --------------------------------------------------------------------------------------------
>   ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
>   ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
>   ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
>   ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
>   ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
>   ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
>   ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
>   ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
>   ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
>   ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
>   ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
>   ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
>   ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ns/op
>   ArraysHashCode.multibytes       10  avgt   15      5.4...

> > You only need one load, add, and multiply per iteration.
> > You don't need to add across columns until the end.
> 
> @theRealAph , I've tried to follow the suggested approach, please find the patch and result in [mikabl-arm at e352f30](https://github.com/mikabl-arm/jdk/commit/e352f30d89e99417231ae7bb66b325c68a76eef9) .
> 
> So far I wasn't able to see any performance benefits compared to an implementations from this MR.

Yeah, true. I can see why that's happening from prof perfnorm:

   4.30%  ↗   0x0000ffff70b3cdec:   mul		v1.4s, v1.4s, v3.4s
   0.45%  │   0x0000ffff70b3cdf0:   ld1		{v0.4s}, [x1], #16
  81.54%  │   0x0000ffff70b3cdf4:   add		v1.4s, v1.4s, v0.4s
   4.83%  │   0x0000ffff70b3cdf8:   subs		w2, w2, #4
   3.55%  ╰   0x0000ffff70b3cdfc:   b.hs		#0xffff70b3cdec

 ArraysHashCode.ints:IPC               1024  avgt          1.395          insns/clk

This is 1.4 insns/clk on a machine that can run 8 insns/clk. Because we're doing one load, then the MAC, then another load after the MAC, then a MAC that depends on the load: we stall the whole core waiting for the next load. Everything is serialized. Neoverse looks the same as Apple M1 here.

I guess the real question here is what we want. x86's engineers get this:

Benchmark            (size)  Mode  Cnt     Score    Error  Units
ArraysHashCode.ints       1  avgt    5     0.834 ±  0.001  ns/op
ArraysHashCode.ints      10  avgt    5     5.500 ±  0.016  ns/op
ArraysHashCode.ints     100  avgt    5    20.330 ±  0.103  ns/op
ArraysHashCode.ints   10000  avgt    5  1365.347 ±  1.045  ns/op

(And that's on my desktop box from 2018, an inferior piece of hardware.)

This is how they do it:

          ↗  0x00007f0634c21c17:   imul   ebx,r11d
   0.02%  │  0x00007f0634c21c1b:   vmovdqu ymm12,YMMWORD PTR [rdi+rsi*4]
          │  0x00007f0634c21c20:   vmovdqu ymm2,YMMWORD PTR [rdi+rsi*4+0x20]
   5.36%  │  0x00007f0634c21c26:   vmovdqu ymm0,YMMWORD PTR [rdi+rsi*4+0x40]
          │  0x00007f0634c21c2c:   vmovdqu ymm1,YMMWORD PTR [rdi+rsi*4+0x60]
   0.05%  │  0x00007f0634c21c32:   vpmulld ymm8,ymm8,ymm3
  11.12%  │  0x00007f0634c21c37:   vpaddd ymm8,ymm8,ymm12
   4.97%  │  0x00007f0634c21c3c:   vpmulld ymm9,ymm9,ymm3
  15.09%  │  0x00007f0634c21c41:   vpaddd ymm9,ymm9,ymm2
   5.16%  │  0x00007f0634c21c45:   vpmulld ymm10,ymm10,ymm3
  15.51%  │  0x00007f0634c21c4a:   vpaddd ymm10,ymm10,ymm0
   5.44%  │  0x00007f0634c21c4e:   vpmulld ymm11,ymm11,ymm3
  16.39%  │  0x00007f0634c21c53:   vpaddd ymm11,ymm11,ymm1
   4.80%  │  0x00007f0634c21c57:   add    esi,0x20
          │  0x00007f0634c21c5a:   cmp    esi,ecx
          ╰  0x00007f0634c21c5c:   jl     0x00007f0634c21c17

So, do we want to try to beat them on Arm, or not? They surely want to beat Arm.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2075376464