[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
Aleksei Voitylov
aleksei.voitylov at bell-sw.com
Wed Nov 6 16:53:02 UTC 2019
Hi Patrick,
I like the fact that this patch does not add much to the complexity of
the code. Here are some experiments that you could find useful.
Cortex A73 Size base (ns/op) patched (ns/op) Diff
StringCompareBench.StringCompareLL 256 14422257,98
15302300,24 -6,10%
StringCompareBench.StringCompareLL 512 27998036,21
28317818,08 -1,14%
ThunderX2 Size base (ns/op) patched (ns/op) Diff
StringCompareBench.StringCompareLL 128 4265122,232
13099099,67 -207,12%
StringCompareBench.StringCompareLL 256 3539452,533
3599407,432 -1,69%
StringCompareBench.StringCompareUU 128 6899938,75
7174601,241 -3,98%
StringCompareBench.StringCompareUU 256 7654538,841
7826599,466 -2,25%
StringCompareBench.cachedStringCompareLL 128 19,673 21,242
-7,98%
StringCompareBench.cachedStringCompareLL 256 34,179 36,452
-6,65%
StringCompareBench.cachedStringCompareLL 512 59,574 64,088
-7,58%
StringCompareBench.cachedStringCompareLL 1024 110,37 118,477
-7,35%
StringCompareBench.cachedStringCompareLL 1000000 114028,907
115388,681 -1,19%
StringCompareBench.cachedStringCompareUU 128 33,752 36,922
-9,39%
StringCompareBench.cachedStringCompareUU 256 60,939 64,096
-5,18%
StringCompareBench.cachedStringCompareUU 512 115,328 118,48
-2,73%
StringCompareBench.cachedStringCompareUU 1024 239,332 242,97
-1,52%
StringCompareBench.cachedStringCompareUU 1000000 226491,096
233638,328 -3,16%
It might be the case that the newly added branch is the culprit:
+ __ subs(rscratch2, cnt2, largeLoopExitCondition);
+ __ br(__ LT, NO_PREFETCH);
Maybe you could skip it when CompareLongStringLimitLatin and
CompareLongStringLimitUTF are large enough (then stub code is only
called with string length large enough to skip branch above). Then (the
properly commented) code would look like:
if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
__ subs(rscratch2, cnt2, largeLoopExitCondition);
__ br(__ LT, NO_PREFETCH);
}
and in this case we shouldn't see any performance penalties.
-Aleksei
On 29/10/2019 12:58, Patrick Zhang OS wrote:
> Hi,
>
> Could you please review this patch, thanks.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8229351
> Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02
> (this starts from .02 since there had been some internal review and updates)
>
> Changes:
>
> 1. Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.
>
> 2. MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.
>
> 3. In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.
>
> 4. In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).
>
> 5. In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.
>
> Tests:
>
> 1. For function check, I have run
>
> jdk jtreg tier1 tests, with default vm flags
>
> hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"
>
> jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;
>
> some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].
>
> 1. For performance check, I have run
>
> string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,
>
> and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).
>
> FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.
>
> Refs:
> [1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string
> [2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string
> [3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko
> [4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic
> [5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev
> [6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko
>
> Regards
> Patrick
>
More information about the aarch64-port-dev
mailing list