RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v2]

Wed Aug 21 14:53:10 UTC 2024

On Tue, 20 Aug 2024 12:21:05 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:

>> Hello,
>> 
>> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively. 
>> 
>> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
>> 
>> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
>> 
>> # Performance
>> 
>> ## Neoverse N1
>> 
>> 
>>   --------------------------------------------------------------------------------------------
>>   Version                                            Baseline           This patch
>>   --------------------------------------------------------------------------------------------
>>   Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
>>   --------------------------------------------------------------------------------------------
>>   ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
>>   ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
>>   ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
>>   ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
>>   ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
>>   ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
>>   ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
>>   ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
>>   ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
>>   ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
>>   ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
>>   ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
>>   ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ...
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit:
> 
>   8322770: AArch64: C2: Implement VectorizedHashCode
>   
>   The code to calculate a hash code consists of two parts: a stub method that
>   implements a vectorized loop using Neon instruction which processes 16 or 32
>   elements per iteration depending on the data type; and an unrolled inlined
>   scalar loop that processes remaining tail elements.
>   
>   [Performance]
>   
>   [[Neoverse V2]]
>   ```
>                                                               |  328a053 (master) |  dc2909f (this)  |
>   ----------------------------------------------------------------------------------------------------------
>     Benchmark                               (size)  Mode  Cnt |    Score    Error |    Score   Error | Units
>   ----------------------------------------------------------------------------------------------------------
>     ArraysHashCode.bytes                         1  avgt   15 |    0.805 ?  0.206 |    0.815 ? 0.141 | ns/op
>     ArraysHashCode.bytes                        10  avgt   15 |    4.362 ?  0.013 |    3.522 ? 0.124 | ns/op
>     ArraysHashCode.bytes                       100  avgt   15 |   78.374 ?  0.136 |   12.935 ? 0.016 | ns/op
>     ArraysHashCode.bytes                     10000  avgt   15 | 9247.335 ? 13.691 | 1344.770 ? 1.898 | ns/op
>     ArraysHashCode.chars                         1  avgt   15 |    0.731 ?  0.035 |    0.723 ? 0.046 | ns/op
>     ArraysHashCode.chars                        10  avgt   15 |    4.359 ?  0.007 |    3.385 ? 0.004 | ns/op
>     ArraysHashCode.chars                       100  avgt   15 |   78.374 ?  0.117 |   11.903 ? 0.023 | ns/op
>     ArraysHashCode.chars                     10000  avgt   15 | 9248.328 ? 13.644 | 1344.007 ? 1.795 | ns/op
>     ArraysHashCode.ints                          1  avgt   15 |    0.746 ?  0.083 |    0.631 ? 0.020 | ns/op
>     ArraysHashCode.ints                         10  avgt   15 |    4.357 ?  0.009 |    3.387 ? 0.005 | ns/op
>     ArraysHashCode.ints                        100  avgt   15 |   78.391 ?  0.103 |   10.934 ? 0.015 | ns/op
>     ArraysHashCode.ints                      10000  avgt   15 | 9248.125 ? 12.583 | 1340.644 ? 1.869 | ns/op
>     ArraysHashCode.multibytes                    1  avgt   15 |    0.555 ?  0.020 |    0.559 ? 0.020 | ns/op
>     ArraysHashCode.multibytes                   10  avgt   15 |    2.681 ?  0.020 |    2.175 ? 0.045 | ns/op
>     ArraysHas...

src/hotspot/share/utilities/intpow.hpp line 58:

> 56: struct intpow<T, v, 1, no_overflow> {
> 57:   static const T value = v;
> 58: };

> There's no need for any of this metaprogramming. A constexpr function would be better.

The main advantage of this implementation over one using a `constexpr` function is the ability to verify input parameter values and detect overflows at compile time using `static_assert`. This feature helps prevent user errors that could otherwise be difficult to debug. I chose this approach for its functional benefits over a `constexpr` function. If you have any suggestions on how to implement this functionality in a `constexpr` function, they would be highly welcome, as I might be missing something here. 😃

Below is a similar `constexpr` implementation. **Please don't accept it through the UI**. I kept a `typedef` template parameter to ensure the return value has the desired width, if needed, without applying a mask at the call site.

If you'd still prefer a `constexpr` function, please let me know if you have any comments on the implementation below (I'll rename it, remove `no_overflow` parameter and the comments).

Suggestion:

#include <limits>

template <typename T>
static constexpr T intpow_c(T v, unsigned p, bool no_overflow = false) {
  // Can't be used in constexpr function
  // static_assert(v || p, "0^0 is not defined");

  if (p == 0) {
    return 1;
  }

  T a = intpow_c(v, p / 2, no_overflow);
  T b = (p % 2) ? v : 1;

  // Can't be used in constexpr function
  // static_assert(!no_overflow || a <= std::numeric_limits<T>::max() / a, "Integer overflow");
  // static_assert(!no_overflow || a * a <= std::numeric_limits<T>::max() / b, "Integer overflow");

  return a * a * b;
}

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1725193686