RFR: 8322770: Implement C2 VectorizedHashCode on AArch64
Mikhail Ablakatov
duke at openjdk.org
Fri Jul 5 17:25:34 UTC 2024
On Thu, 16 May 2024 12:40:30 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Hi,
>>
>>> I can update the patch with current results on Monday and we could decide how to proceed with this PR after that. Sounds good?
>>
>> Yes, that's right.
>
>> Hi @theRealAph ! You may find the latest version here: [mikabl-arm at b3db421](https://github.com/mikabl-arm/jdk/commit/b3db421c795f683db1a001853990026bafc2ed4b) . I gave a short explanation in the commit message, feel free to ask for more details if required.
>>
>> Unfortunately, it still contains critical bugs and I won't be able to take a look into the issue before the next week at best. Until it's fixed, it's not possible to run the benchmarks. Although I expect it to improve performance on longer integer arrays based on a benchmark I've written in C++ and Assembly. The results aren't comparable to the jmh results, so I won't post them here.
>
> OK. One small thing, I think it's possible to rearrange things a bit to use `mlav`, which may help performance. No need for that until the code is correct, though.
Hi @theRealAph ! This took a while, but please find a fixed version here: https://github.com/mikabl-arm/jdk/tree/285826-vmul
Here are performance numbers collected for Neoverse V2 compared to the common baseline and the latest state of this PR:
| d2ea6b1e657 | f19203015fb | 5504227bfe3 |
| baseline | PR | 285826-vmul |
----------------------------------------------------------|---------------------------------------|------------------|------
Benchmark (size) Mode Cnt | Score Error | Score Error | Score Error | Units
----------------------------------------------------------|---------------------------------------|------------------|------
ArraysHashCode.bytes 1 avgt 15 | 0.859 ? 0.166 | 0.720 ? 0.103 | 0.732 ? 0.105 | ns/op
ArraysHashCode.bytes 10 avgt 15 | 4.440 ? 0.013 | 2.262 ? 0.009 | 3.454 ? 0.057 | ns/op
ArraysHashCode.bytes 100 avgt 15 | 78.642 ? 0.119 | 15.997 ? 0.023 | 12.753 ? 0.072 | ns/op
ArraysHashCode.bytes 10000 avgt 15 | 9248.961 ? 11.332 | 1879.905 ? 11.609 | 1345.014 ? 1.947 | ns/op
ArraysHashCode.chars 1 avgt 15 | 0.695 ? 0.036 | 0.694 ? 0.035 | 0.682 ? 0.036 | ns/op
ArraysHashCode.chars 10 avgt 15 | 4.436 ? 0.015 | 2.428 ? 0.034 | 3.352 ? 0.031 | ns/op
ArraysHashCode.chars 100 avgt 15 | 78.660 ? 0.113 | 14.508 ? 0.075 | 11.784 ? 0.088 | ns/op
ArraysHashCode.chars 10000 avgt 15 | 9253.807 ? 13.660 | 2010.053 ? 3.549 | 1344.716 ? 1.936 | ns/op
ArraysHashCode.ints 1 avgt 15 | 0.635 ? 0.022 | 0.640 ? 0.022 | 0.640 ? 0.022 | ns/op
ArraysHashCode.ints 10 avgt 15 | 4.424 ? 0.006 | 2.752 ? 0.012 | 3.388 ? 0.004 | ns/op
ArraysHashCode.ints 100 avgt 15 | 78.680 ? 0.120 | 14.794 ? 0.131 | 11.090 ? 0.055 | ns/op
ArraysHashCode.ints 10000 avgt 15 | 9249.520 ? 13.305 | 1997.441 ? 3.299 | 1340.916 ? 1.843 | ns/op
ArraysHashCode.multibytes 1 avgt 15 | 0.566 ? 0.023 | 0.563 ? 0.021 | 0.554 ? 0.012 | ns/op
ArraysHashCode.multibytes 10 avgt 15 | 2.679 ? 0.018 | 1.798 ? 0.038 | 1.973 ? 0.021 | ns/op
ArraysHashCode.multibytes 100 avgt 15 | 36.934 ? 0.055 | 9.118 ? 0.018 | 12.712 ? 0.026 | ns/op
ArraysHashCode.multibytes 10000 avgt 15 | 4861.700 ? 6.563 | 1005.809 ? 2.260 | 721.366 ? 1.570 | ns/op
ArraysHashCode.multichars 1 avgt 15 | 0.557 ? 0.016 | 0.552 ? 0.001 | 0.563 ? 0.021 | ns/op
ArraysHashCode.multichars 10 avgt 15 | 2.700 ? 0.018 | 1.840 ? 0.024 | 1.978 ? 0.008 | ns/op
ArraysHashCode.multichars 100 avgt 15 | 36.932 ? 0.054 | 8.633 ? 0.020 | 8.678 ? 0.052 | ns/op
ArraysHashCode.multichars 10000 avgt 15 | 4859.462 ? 6.693 | 1063.788 ? 3.057 | 752.857 ? 5.262 | ns/op
ArraysHashCode.multiints 1 avgt 15 | 0.574 ? 0.023 | 0.554 ? 0.011 | 0.559 ? 0.017 | ns/op
ArraysHashCode.multiints 10 avgt 15 | 2.707 ? 0.028 | 1.907 ? 0.031 | 1.992 ? 0.036 | ns/op
ArraysHashCode.multiints 100 avgt 15 | 36.942 ? 0.056 | 9.141 ? 0.013 | 8.174 ? 0.029 | ns/op
ArraysHashCode.multiints 10000 avgt 15 | 4872.540 ? 7.479 | 1187.393 ? 12.083 | 785.256 ? 9.472 | ns/op
ArraysHashCode.multishorts 1 avgt 15 | 0.558 ? 0.016 | 0.555 ? 0.012 | 0.566 ? 0.022 | ns/op
ArraysHashCode.multishorts 10 avgt 15 | 2.696 ? 0.015 | 1.854 ? 0.027 | 1.983 ? 0.009 | ns/op
ArraysHashCode.multishorts 100 avgt 15 | 36.930 ? 0.051 | 8.652 ? 0.011 | 8.681 ? 0.039 | ns/op
ArraysHashCode.multishorts 10000 avgt 15 | 4863.966 ? 6.736 | 1068.627 ? 1.902 | 760.280 ? 5.150 | ns/op
ArraysHashCode.shorts 1 avgt 15 | 0.665 ? 0.058 | 0.644 ? 0.022 | 0.636 ? 0.023 | ns/op
ArraysHashCode.shorts 10 avgt 15 | 4.431 ? 0.006 | 2.432 ? 0.024 | 3.332 ? 0.026 | ns/op
ArraysHashCode.shorts 100 avgt 15 | 78.630 ? 0.103 | 14.521 ? 0.077 | 11.783 ? 0.093 | ns/op
ArraysHashCode.shorts 10000 avgt 15 | 9249.908 ? 12.039 | 2010.461 ? 2.548 | 1344.441 ? 1.818 | ns/op
StringHashCode.Algorithm.defaultLatin1 1 avgt 15 | 0.770 ? 0.001 | 0.770 ? 0.001 | 0.770 ? 0.001 | ns/op
StringHashCode.Algorithm.defaultLatin1 10 avgt 15 | 4.305 ? 0.009 | 2.260 ? 0.009 | 3.433 ? 0.015 | ns/op
StringHashCode.Algorithm.defaultLatin1 100 avgt 15 | 78.355 ? 0.102 | 16.140 ? 0.038 | 12.767 ? 0.023 | ns/op
StringHashCode.Algorithm.defaultLatin1 10000 avgt 15 | 9269.665 ? 13.817 | 1893.354 ? 3.677 | 1345.571 ? 1.930 | ns/op
StringHashCode.Algorithm.defaultUTF16 1 avgt 15 | 0.736 ? 0.100 | 0.653 ? 0.083 | 0.690 ? 0.101 | ns/op
StringHashCode.Algorithm.defaultUTF16 10 avgt 15 | 4.280 ? 0.018 | 2.374 ? 0.021 | 3.394 ? 0.010 | ns/op
StringHashCode.Algorithm.defaultUTF16 100 avgt 15 | 78.312 ? 0.118 | 14.603 ? 0.103 | 11.837 ? 0.016 | ns/op
StringHashCode.Algorithm.defaultUTF16 10000 avgt 15 | 9249.562 ? 13.113 | 2011.717 ? 4.097 | 1344.715 ? 1.896 | ns/op
StringHashCode.cached N/A avgt 15 | 0.539 ? 0.027 | 0.525 ? 0.018 | 0.525 ? 0.018 | ns/op
StringHashCode.empty N/A avgt 15 | 0.861 ? 0.163 | 0.670 ? 0.079 | 0.694 ? 0.093 | ns/op
StringHashCode.notCached N/A avgt 15 | 0.698 ? 0.108 | 0.648 ? 0.024 | 0.637 ? 0.023 | ns/op
There are several known issues:
- [ ] For arrays shorter than the number of elements processed by a single iteration of the Neon loop performance is not optimal, though still better than the baseline's.
- [ ] The intrinsic take 364 Bytes in the worst case (for BYTE/BOOLEAN types) which may either significantly increase code size or limit inlining opportunities.
- [ ] As mentioned before, the implementation might be affected by https://bugs.openjdk.org/browse/JDK-8139457 .
To address the first two we could implement the vectorized part of the algorithm as a separate stub method. Please let me know if this sound like a right approach or you have other suggestions.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2211186951
More information about the hotspot-dev
mailing list