RFR: 8310268: RISC-V: misaligned memory access in String.Compare intrinsic [v3]

Mon Jul 24 06:14:45 UTC 2023

On Thu, 20 Jul 2023 08:31:56 GMT, Vladimir Kempik <vkempik at openjdk.org> wrote:

>> Please review this fix. it eliminates misaligned loads in String.compare on risc-v
>> 
>> for small compares ( <= 72 bytes), the instrinsic in c2_MacroAssembler.cpp is used,
>> it reads ( in case of UU/LL) 8 bytes per loop, and at then end, it reads tail - misaligned load of last 8 bytes from the string.
>> 
>> so if string length is not 8x bytes long then last load is misaligned, also it performs read/compare of some data which already was processed.
>> 
>> I have changed that to compare only last length%8 bytes using SHORT_STRING part of intrinsic for UL/LU. But for UU/LL I have made an optimised version.
>> 
>> Thanks to optimisations for conditional branching at line [947](https://github.com/openjdk/jdk/pull/14534/files#diff-35eb1d2f1e2f0514dd46bd7fbad49ff2c87703d5a3041a6433956df00a3fe6e6R947) I’ve got no perf drop on thead ( with +AvoidUnalignedAccesses) which supports misaligned access.
>> 
>> Improvements to inflate_XX() methods gives 3-5% improvements for UL/LU cases on thead, almost no perf change on hifive.
>> 
>> for large strings, the instrinsics from stubGenerator.cpp is used
>> for UU/LL - generate_compare_long_string_same_encoding, I have just replaced misaligned ld with load_long_misaligned. Since this tail reading is not on hot path, this give some small penalty for thead when -XX:+AvoidUnalignedAccesses.
>> 
>> large LU/UL comparision is done in compare_long_string_different_encoding in sutbGenerator.cpp:
>> These changes are partially based on feilongjiang's patch, but I have changed tail reading to prevent reading past the end of string. I have observed no perf difference between feilongjiang's and my version.
>> 
>> This also enables regression test for string.Compare which previously was aarch64-only
>> 
>> Testing: tier1 and tier2 clean on hifive.
>> 
>> JMH testing, hifive:
>> before:
>> 
>> Benchmark                                   (delta)  (size)  Mode  Cnt     Score    Error  Units
>> StringCompareToDifferentLength.compareToLL        2      24  avgt    9     6.474 ±  1.475  ms/op
>> StringCompareToDifferentLength.compareToLL        2      36  avgt    9   125.823 ±  1.947  ms/op
>> StringCompareToDifferentLength.compareToLL        2      72  avgt    9    10.512 ±  0.236  ms/op
>> StringCompareToDifferentLength.compareToLL        2     128  avgt    9    13.032 ±  0.821  ms/op
>> StringCompareToDifferentLength.compareToLL        2     256  avgt    9    18.983 ±  0.318  ms/op
>> StringCompareToDifferentLength.compareToLL        2     512  avgt    9    29.925 ± ...
>
> Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Simplify case for long LU UL compares

Hi, Thanks for the update. Taking another look.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2315:

> 2313:   void compare_string_8_x_LU(Register tmpL, Register tmpU, Register strL, Register strU, Label& DIFF) {
> 2314:     const Register tmp = x30;
> 2315:     __ ld(tmpL, Address(strL));

Could we make use of another tmp register (maybe `x7`) and use that as the destination register for `ld` instead? Then we could inflate_lo32/hi32 from this tmp register and put the result in `tmpL`. That way would help remove the two `mv` instructions here in this function and the one located at label `DIFF`.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2368:

> 2366: 
> 2367:     // make sure main loop is 8 byte-aligned, we should load another 4 bytes from strL
> 2368:     __ beqz(cnt2, DONE);  // no characters left

I don't quite understand this newly-added branch. If the intention was to ensure that we have at lease 4 chars left, shoudn't we compare with 4 instead? But given that the const `STUB_THRESHOLD` is (64 + 8) in C2_MacroAssembler::string_compare, I think neigher will be taken.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2379:

> 2377:     __ addi(cnt2, cnt2, -wordSize / 2);
> 2378: 
> 2379:     __ beqz(cnt2, DONE);  // no character left

Similar here. Will this newly-added branch ever be taken?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 2540:

> 2538:       __ ld(tmp2, Address(str2));
> 2539:       __ addi(str2, str2, 8);
> 2540:       __ beqz(cnt2, LAST_CHECK_AND_LENGTH_DIFF);

I wonder when this branch instruction will taken?

-------------

Changes requested by fyang (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/14534#pullrequestreview-1542726484
PR Review Comment: https://git.openjdk.org/jdk/pull/14534#discussion_r1271781757
PR Review Comment: https://git.openjdk.org/jdk/pull/14534#discussion_r1271786219
PR Review Comment: https://git.openjdk.org/jdk/pull/14534#discussion_r1271787329
PR Review Comment: https://git.openjdk.org/jdk/pull/14534#discussion_r1271776226