RFR: 8322770: Implement C2 VectorizedHashCode on AArch64

Mon Apr 22 12:32:28 UTC 2024

On Tue, 26 Mar 2024 13:59:12 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:

> Hello,
> 
> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively. 
> 
> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
> 
> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
> 
> # Performance
> 
> ## Neoverse N1
> 
> 
>   --------------------------------------------------------------------------------------------
>   Version                                            Baseline           This patch
>   --------------------------------------------------------------------------------------------
>   Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
>   --------------------------------------------------------------------------------------------
>   ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
>   ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
>   ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
>   ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
>   ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
>   ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
>   ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
>   ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
>   ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
>   ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
>   ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
>   ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
>   ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ns/op
>   ArraysHashCode.multibytes       10  avgt   15      5.4...

You only need one load, add, and multiply per iteration.
You don't need to add across columns until the end.

This is an example of how to do it. The full thing is at https://gist.github.com/theRealAph/cbc85299d6cd24101d46a06c12a97ce6.

    public static int vectorizedHashCode(int result, int[] a, int fromIndex, int length) {
        if (length < WIDTH) {
            return hashCode(result, a, fromIndex, length);
        }
        int offset = fromIndex;
        int[] sum = new int[WIDTH];
        sum[WIDTH - 1] = result;
        int[] temp = new int[WIDTH];
        int remaining = length;
        while (remaining >= WIDTH * 2) {
            vmult(sum, sum, n31powerWIDTH);
            vload(temp, a, offset);
            vadd(sum, sum, temp);
            offset += WIDTH;
            remaining -= WIDTH;
        }
        vmult(sum, sum, n31powerWIDTH);
        vload(temp, a, offset);
        vadd(sum, sum, temp);
        vmult(sum, sum, n31powersToWIDTH);
        offset += WIDTH;
        remaining -= WIDTH;
        result = vadd(sum);
        return hashCode(result, a, fromIndex + offset, remaining);
    }

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2069274761