RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v2]
Mikhail Ablakatov
duke at openjdk.org
Wed Aug 21 14:53:10 UTC 2024
On Tue, 20 Aug 2024 12:21:05 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:
>> Hello,
>>
>> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively.
>>
>> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
>>
>> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
>>
>> # Performance
>>
>> ## Neoverse N1
>>
>>
>> --------------------------------------------------------------------------------------------
>> Version Baseline This patch
>> --------------------------------------------------------------------------------------------
>> Benchmark (size) Mode Cnt Score Error Score Error Units
>> --------------------------------------------------------------------------------------------
>> ArraysHashCode.bytes 1 avgt 15 1.249 ? 0.060 1.247 ? 0.062 ns/op
>> ArraysHashCode.bytes 10 avgt 15 8.754 ? 0.028 4.387 ? 0.015 ns/op
>> ArraysHashCode.bytes 100 avgt 15 98.596 ? 0.051 26.655 ? 0.097 ns/op
>> ArraysHashCode.bytes 10000 avgt 15 10150.578 ? 1.352 2649.962 ? 216.744 ns/op
>> ArraysHashCode.chars 1 avgt 15 1.286 ? 0.062 1.246 ? 0.054 ns/op
>> ArraysHashCode.chars 10 avgt 15 8.731 ? 0.002 5.344 ? 0.003 ns/op
>> ArraysHashCode.chars 100 avgt 15 98.632 ? 0.048 23.023 ? 0.142 ns/op
>> ArraysHashCode.chars 10000 avgt 15 10150.658 ? 3.374 2410.504 ? 8.872 ns/op
>> ArraysHashCode.ints 1 avgt 15 1.189 ? 0.005 1.187 ? 0.001 ns/op
>> ArraysHashCode.ints 10 avgt 15 8.730 ? 0.002 5.676 ? 0.001 ns/op
>> ArraysHashCode.ints 100 avgt 15 98.559 ? 0.016 24.378 ? 0.006 ns/op
>> ArraysHashCode.ints 10000 avgt 15 10148.752 ? 1.336 2419.015 ? 0.492 ns/op
>> ArraysHashCode.multibytes 1 avgt 15 1.037 ? 0.001 1.037 ? 0.001 ...
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit:
>
> 8322770: AArch64: C2: Implement VectorizedHashCode
>
> The code to calculate a hash code consists of two parts: a stub method that
> implements a vectorized loop using Neon instruction which processes 16 or 32
> elements per iteration depending on the data type; and an unrolled inlined
> scalar loop that processes remaining tail elements.
>
> [Performance]
>
> [[Neoverse V2]]
> ```
> | 328a053 (master) | dc2909f (this) |
> ----------------------------------------------------------------------------------------------------------
> Benchmark (size) Mode Cnt | Score Error | Score Error | Units
> ----------------------------------------------------------------------------------------------------------
> ArraysHashCode.bytes 1 avgt 15 | 0.805 ? 0.206 | 0.815 ? 0.141 | ns/op
> ArraysHashCode.bytes 10 avgt 15 | 4.362 ? 0.013 | 3.522 ? 0.124 | ns/op
> ArraysHashCode.bytes 100 avgt 15 | 78.374 ? 0.136 | 12.935 ? 0.016 | ns/op
> ArraysHashCode.bytes 10000 avgt 15 | 9247.335 ? 13.691 | 1344.770 ? 1.898 | ns/op
> ArraysHashCode.chars 1 avgt 15 | 0.731 ? 0.035 | 0.723 ? 0.046 | ns/op
> ArraysHashCode.chars 10 avgt 15 | 4.359 ? 0.007 | 3.385 ? 0.004 | ns/op
> ArraysHashCode.chars 100 avgt 15 | 78.374 ? 0.117 | 11.903 ? 0.023 | ns/op
> ArraysHashCode.chars 10000 avgt 15 | 9248.328 ? 13.644 | 1344.007 ? 1.795 | ns/op
> ArraysHashCode.ints 1 avgt 15 | 0.746 ? 0.083 | 0.631 ? 0.020 | ns/op
> ArraysHashCode.ints 10 avgt 15 | 4.357 ? 0.009 | 3.387 ? 0.005 | ns/op
> ArraysHashCode.ints 100 avgt 15 | 78.391 ? 0.103 | 10.934 ? 0.015 | ns/op
> ArraysHashCode.ints 10000 avgt 15 | 9248.125 ? 12.583 | 1340.644 ? 1.869 | ns/op
> ArraysHashCode.multibytes 1 avgt 15 | 0.555 ? 0.020 | 0.559 ? 0.020 | ns/op
> ArraysHashCode.multibytes 10 avgt 15 | 2.681 ? 0.020 | 2.175 ? 0.045 | ns/op
> ArraysHas...
src/hotspot/share/utilities/intpow.hpp line 58:
> 56: struct intpow<T, v, 1, no_overflow> {
> 57: static const T value = v;
> 58: };
> There's no need for any of this metaprogramming. A constexpr function would be better.
The main advantage of this implementation over one using a `constexpr` function is the ability to verify input parameter values and detect overflows at compile time using `static_assert`. This feature helps prevent user errors that could otherwise be difficult to debug. I chose this approach for its functional benefits over a `constexpr` function. If you have any suggestions on how to implement this functionality in a `constexpr` function, they would be highly welcome, as I might be missing something here. 😃
Below is a similar `constexpr` implementation. **Please don't accept it through the UI**. I kept a `typedef` template parameter to ensure the return value has the desired width, if needed, without applying a mask at the call site.
If you'd still prefer a `constexpr` function, please let me know if you have any comments on the implementation below (I'll rename it, remove `no_overflow` parameter and the comments).
Suggestion:
#include <limits>
template <typename T>
static constexpr T intpow_c(T v, unsigned p, bool no_overflow = false) {
// Can't be used in constexpr function
// static_assert(v || p, "0^0 is not defined");
if (p == 0) {
return 1;
}
T a = intpow_c(v, p / 2, no_overflow);
T b = (p % 2) ? v : 1;
// Can't be used in constexpr function
// static_assert(!no_overflow || a <= std::numeric_limits<T>::max() / a, "Integer overflow");
// static_assert(!no_overflow || a * a <= std::numeric_limits<T>::max() / b, "Integer overflow");
return a * a * b;
}
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1725193686
More information about the hotspot-dev
mailing list