RFR: 8310268: RISC-V: misaligned memory access in String.Compare intrinsic

Mon Jun 26 03:52:02 UTC 2023

On Mon, 19 Jun 2023 05:57:05 GMT, Vladimir Kempik <vkempik at openjdk.org> wrote:

> Please review this fix. it eliminates misaligned loads in String.compare on risc-v
> 
> for small compares ( <= 72 bytes), the instrinsic in c2_MacroAssembler.cpp is used,
> it reads ( in case of UU/LL) 8 bytes per loop, and at then end, it reads tail - misaligned load of last 8 bytes from the string.
> 
> so if string length is not 8x bytes long then last load is misaligned, also it performs read/compare of some data which already was processed.
> 
> I have changed that to compare only last length%8 bytes using SHORT_STRING part of intrinsic for UL/LU. But for UU/LL I have made an optimised version.
> 
> Thanks to optimisations for conditional branching at line [947](https://github.com/openjdk/jdk/pull/14534/files#diff-35eb1d2f1e2f0514dd46bd7fbad49ff2c87703d5a3041a6433956df00a3fe6e6R947) I’ve got no perf drop on thead ( with +AvoidUnalignedAccesses) which supports misaligned access.
> 
> Improvements to inflate_XX() methods gives 3-5% improvements for UL/LU cases on thead, almost no perf change on hifive.
> 
> for large strings, the instrinsics from stubGenerator.cpp is used
> for UU/LL - generate_compare_long_string_same_encoding, I have just replaced misaligned ld with load_long_misaligned. Since this tail reading is not on hot path, this give some small penalty for thead when -XX:+AvoidUnalignedAccesses.
> 
> large LU/UL comparision is done in compare_long_string_different_encoding in sutbGenerator.cpp:
> These changes are partially based on feilongjiang's patch, but I have changed tail reading to prevent reading past the end of string. I have observed no perf difference between feilongjiang's and my version.
> 
> This also enables regression test for string.Compare which previously was aarch64-only
> 
> Testing: tier1 and tier2 clean on hifive.
> 
> JMH testing, hifive:
> before:
> 
> Benchmark                                   (delta)  (size)  Mode  Cnt     Score    Error  Units
> StringCompareToDifferentLength.compareToLL        2      24  avgt    9     6.474 ±  1.475  ms/op
> StringCompareToDifferentLength.compareToLL        2      36  avgt    9   125.823 ±  1.947  ms/op
> StringCompareToDifferentLength.compareToLL        2      72  avgt    9    10.512 ±  0.236  ms/op
> StringCompareToDifferentLength.compareToLL        2     128  avgt    9    13.032 ±  0.821  ms/op
> StringCompareToDifferentLength.compareToLL        2     256  avgt    9    18.983 ±  0.318  ms/op
> StringCompareToDifferentLength.compareToLL        2     512  avgt    9    29.925 ±  0.458  ms/op
> StringCompareToDifferentLength.compareToLL        2 ...

Hi, I am looking at this but it really takes quite some time for me understand this.  I have to say that the added logic for handling unaligned accesses makes the already complex assembly code even harder to read. And I am also worried about code maintainability as previously mentioned by @theRealAph about increasing checkings for AvoidUnalignedAccesses.

After a cursory look at this change, it seems that we could make use of the newly-added `load_X_misaligned` assemblers for the possibly unaligned memory accesses instead of adding the extra logic under AvoidUnalignedAccesses.  This might not offer us the optimal performance but it does reduce code complexity.

PS: At the same time, I think it's time to distinguish support for unaligned access for different platforms when measuring performance. We might want to manually set AvoidUnalignedAccesses to false for platforms like T-HEAD running older kernels. Thanks to: https://bugs.openjdk.org/browse/JDK-8309258, we have the proper detection for new kernels. We would add more vender-specific settings in the future.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14534#issuecomment-1606548362