RFR: 8310268: RISC-V: misaligned memory access in String.Compare intrinsic [v3]

Vladimir Kempik vkempik at openjdk.org
Thu Jul 20 08:54:43 UTC 2023


On Thu, 20 Jul 2023 08:31:56 GMT, Vladimir Kempik <vkempik at openjdk.org> wrote:

>> Please review this fix. it eliminates misaligned loads in String.compare on risc-v
>> 
>> for small compares ( <= 72 bytes), the instrinsic in c2_MacroAssembler.cpp is used,
>> it reads ( in case of UU/LL) 8 bytes per loop, and at then end, it reads tail - misaligned load of last 8 bytes from the string.
>> 
>> so if string length is not 8x bytes long then last load is misaligned, also it performs read/compare of some data which already was processed.
>> 
>> I have changed that to compare only last length%8 bytes using SHORT_STRING part of intrinsic for UL/LU. But for UU/LL I have made an optimised version.
>> 
>> Thanks to optimisations for conditional branching at line [947](https://github.com/openjdk/jdk/pull/14534/files#diff-35eb1d2f1e2f0514dd46bd7fbad49ff2c87703d5a3041a6433956df00a3fe6e6R947) I’ve got no perf drop on thead ( with +AvoidUnalignedAccesses) which supports misaligned access.
>> 
>> Improvements to inflate_XX() methods gives 3-5% improvements for UL/LU cases on thead, almost no perf change on hifive.
>> 
>> for large strings, the instrinsics from stubGenerator.cpp is used
>> for UU/LL - generate_compare_long_string_same_encoding, I have just replaced misaligned ld with load_long_misaligned. Since this tail reading is not on hot path, this give some small penalty for thead when -XX:+AvoidUnalignedAccesses.
>> 
>> large LU/UL comparision is done in compare_long_string_different_encoding in sutbGenerator.cpp:
>> These changes are partially based on feilongjiang's patch, but I have changed tail reading to prevent reading past the end of string. I have observed no perf difference between feilongjiang's and my version.
>> 
>> This also enables regression test for string.Compare which previously was aarch64-only
>> 
>> Testing: tier1 and tier2 clean on hifive.
>> 
>> JMH testing, hifive:
>> before:
>> 
>> Benchmark                                   (delta)  (size)  Mode  Cnt     Score    Error  Units
>> StringCompareToDifferentLength.compareToLL        2      24  avgt    9     6.474 ±  1.475  ms/op
>> StringCompareToDifferentLength.compareToLL        2      36  avgt    9   125.823 ±  1.947  ms/op
>> StringCompareToDifferentLength.compareToLL        2      72  avgt    9    10.512 ±  0.236  ms/op
>> StringCompareToDifferentLength.compareToLL        2     128  avgt    9    13.032 ±  0.821  ms/op
>> StringCompareToDifferentLength.compareToLL        2     256  avgt    9    18.983 ±  0.318  ms/op
>> StringCompareToDifferentLength.compareToLL        2     512  avgt    9    29.925 ± ...
>
> Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Simplify case for long LU UL compares

Results for latest update, from thead
+AvoidUnaligned

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    4.000 ± 0.106  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    4.562 ± 0.089  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    7.536 ± 0.085  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9   10.341 ± 0.287  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   15.275 ± 0.249  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   21.731 ± 0.413  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   20.255 ± 0.287  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   22.114 ± 0.641  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    7.615 ± 0.032  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9   10.566 ± 0.096  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   21.975 ± 0.288  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   36.078 ± 0.419  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   65.567 ± 0.715  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  124.196 ± 0.636  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  126.580 ± 1.431  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  129.830 ± 1.857  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9   10.386 ± 0.368  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   12.981 ± 0.271  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   23.726 ± 0.532  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   37.997 ± 0.482  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   67.834 ± 0.915  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  126.500 ± 0.771  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  128.853 ± 2.059  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  132.825 ± 3.318  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    4.013 ± 0.012  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9    4.845 ± 0.148  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9   10.276 ± 0.313  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   14.338 ± 0.201  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   20.912 ± 0.550  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   34.264 ± 0.660  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   34.557 ± 0.252  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   34.841 ± 0.380  ms/op

-AvoidUnaligned

Benchmark                                   (delta)  (size)  Mode  Cnt    Score   Error  Units
StringCompareToDifferentLength.compareToLL        2      24  avgt    9    2.557 ± 0.034  ms/op
StringCompareToDifferentLength.compareToLL        2      36  avgt    9    3.507 ± 0.035  ms/op
StringCompareToDifferentLength.compareToLL        2      72  avgt    9    7.513 ± 0.033  ms/op
StringCompareToDifferentLength.compareToLL        2     128  avgt    9    9.095 ± 0.210  ms/op
StringCompareToDifferentLength.compareToLL        2     256  avgt    9   13.666 ± 0.134  ms/op
StringCompareToDifferentLength.compareToLL        2     512  avgt    9   20.131 ± 0.234  ms/op
StringCompareToDifferentLength.compareToLL        2     520  avgt    9   20.115 ± 0.065  ms/op
StringCompareToDifferentLength.compareToLL        2     523  avgt    9   20.865 ± 0.224  ms/op
StringCompareToDifferentLength.compareToLU        2      24  avgt    9    7.091 ± 0.067  ms/op
StringCompareToDifferentLength.compareToLU        2      36  avgt    9    9.883 ± 0.109  ms/op
StringCompareToDifferentLength.compareToLU        2      72  avgt    9   22.037 ± 0.327  ms/op
StringCompareToDifferentLength.compareToLU        2     128  avgt    9   35.914 ± 0.307  ms/op
StringCompareToDifferentLength.compareToLU        2     256  avgt    9   65.673 ± 1.075  ms/op
StringCompareToDifferentLength.compareToLU        2     512  avgt    9  124.257 ± 0.722  ms/op
StringCompareToDifferentLength.compareToLU        2     520  avgt    9  126.128 ± 0.453  ms/op
StringCompareToDifferentLength.compareToLU        2     523  avgt    9  129.413 ± 1.567  ms/op
StringCompareToDifferentLength.compareToUL        2      24  avgt    9    9.661 ± 0.440  ms/op
StringCompareToDifferentLength.compareToUL        2      36  avgt    9   12.106 ± 0.290  ms/op
StringCompareToDifferentLength.compareToUL        2      72  avgt    9   23.903 ± 0.441  ms/op
StringCompareToDifferentLength.compareToUL        2     128  avgt    9   38.722 ± 1.049  ms/op
StringCompareToDifferentLength.compareToUL        2     256  avgt    9   67.640 ± 0.957  ms/op
StringCompareToDifferentLength.compareToUL        2     512  avgt    9  126.744 ± 1.904  ms/op
StringCompareToDifferentLength.compareToUL        2     520  avgt    9  129.400 ± 2.463  ms/op
StringCompareToDifferentLength.compareToUL        2     523  avgt    9  130.664 ± 1.380  ms/op
StringCompareToDifferentLength.compareToUU        2      24  avgt    9    3.662 ± 0.166  ms/op
StringCompareToDifferentLength.compareToUU        2      36  avgt    9    4.552 ± 0.217  ms/op
StringCompareToDifferentLength.compareToUU        2      72  avgt    9    9.399 ± 0.270  ms/op
StringCompareToDifferentLength.compareToUU        2     128  avgt    9   13.688 ± 0.294  ms/op
StringCompareToDifferentLength.compareToUU        2     256  avgt    9   20.033 ± 0.290  ms/op
StringCompareToDifferentLength.compareToUU        2     512  avgt    9   33.512 ± 0.433  ms/op
StringCompareToDifferentLength.compareToUU        2     520  avgt    9   33.796 ± 0.435  ms/op
StringCompareToDifferentLength.compareToUU        2     523  avgt    9   33.983 ± 0.152  ms/op


hifive results to follow soon

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14534#issuecomment-1643534917


More information about the hotspot-dev mailing list