[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
Aleksei Voitylov
aleksei.voitylov at bell-sw.com
Tue Nov 12 16:10:28 UTC 2019
Hi Patrick,
First, I'm a trespasser, not a reviewer. Reviewers will need to look at
this.
On the technical side:
This additional branch in v3 is still painful. You can reduce the amount
of branches in code path for lengths 128 and 256 by using that fact that
CompareLongStringLimitLatin and CompareLongStringLimitUTF are at least
24. Then we don't have to jump to NO_PREFETCH label, where check for
small string size is done. Instead we can jump to SMALL_LOOP label
(assuming cnt2 counter is updated accordingly). In this case NO_PREFETCH
label is not needed and we have 1 less branch.
Rough sketch of affected part based on v3 looks as follows. This version
was checked on ThunderX2 and it looks fine on length 128 and 256
perf-wise. I also added alignment for small loop, which also helps a
bit. Please keep in mind it's a sketch.
-Aleksei
@@ -4172,19 +4168,34 @@
Register result = r0, str1 = r1, cnt1 = r2, str2 = r3, cnt2 = r4,
tmp1 = r10, tmp2 = r11;
Label SMALL_LOOP, LARGE_LOOP_PREFETCH, CHECK_LAST, DIFF2, TAIL,
- LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF,
+ LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF, SMALL_LOOP_CHECK,
DIFF_LAST_POSITION, DIFF_LAST_POSITION2;
// exit from large loop when less than 64 bytes left to read or
we're about
// to prefetch memory behind array border
int largeLoopExitCondition = MAX(64,
SoftwarePrefetchHintDistance)/(isLL ? 1 : 2);
+ // calculate the remaining limit in chars which manages if this
stub should be called,
+ // if the limit is large enough (>= largeLoopExitCondition), below
large loop with prefetching
+ // can be executed at least once, and there is no need to do any
extra checking at the entrance.
+ int remainingLimit = (isLL ? CompareLongStringLimitLatin :
CompareLongStringLimitUTF) -
+ (wordSize / (isLL ? 1 : 2));
// cnt1/cnt2 contains amount of characters to compare. cnt1 can be
re-used
// update cnt2 counter with already loaded 8 bytes
- __ sub(cnt2, cnt2, wordSize/(isLL ? 1 : 2));
+ if (SoftwarePrefetchHintDistance >= 0 && remainingLimit <
largeLoopExitCondition) {
+ __ sub(cnt2, cnt2, isLL ? 24 : 12);
+ } else {
+ __ sub(cnt2, cnt2, isLL ? 8 : 4);
+ }
// update pointers, because of previous read
__ add(str1, str1, wordSize);
__ add(str2, str2, wordSize);
if (SoftwarePrefetchHintDistance >= 0) {
- __ bind(LARGE_LOOP_PREFETCH);
+ if (remainingLimit < largeLoopExitCondition) {
+ // there could be fewer bytes left and invalid for this large
loop with prefetching
+ __ subs(rscratch2, cnt2, largeLoopExitCondition - (isLL ? 16 :
8));
+ __ br(__ LT, SMALL_LOOP);
+ __ add(cnt2, cnt2, isLL ? 16 : 8);
+ }
+ __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
__ prfm(Address(str1, SoftwarePrefetchHintDistance));
__ prfm(Address(str2, SoftwarePrefetchHintDistance));
compare_string_16_bytes_same(DIFF, DIFF2);
@@ -4196,11 +4207,11 @@
__ br(__ GT, LARGE_LOOP_PREFETCH);
__ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
}
- // less than 16 bytes left?
- __ subs(cnt2, cnt2, isLL ? 16 : 8);
- __ br(__ LT, TAIL);
+ __ b(SMALL_LOOP_CHECK); // check if less than 16 bytes left
+ __ align(OptoLoopAlignment);
__ bind(SMALL_LOOP);
compare_string_16_bytes_same(DIFF, DIFF2);
+ __ bind(SMALL_LOOP_CHECK);
__ subs(cnt2, cnt2, isLL ? 16 : 8);
__ br(__ GE, SMALL_LOOP);
__ bind(TAIL);
On 12/11/2019 12:52, Patrick Zhang OS wrote:
> Ping...
>
> Hi Aleksei,
>
> Does the potential regression on my test system still exist? with the new patch webrev.03? If the added largeLoopExitCondition condition excluded your >128 chars strings from the large loop, and still caused performance drops, maybe hacking all the software prefetch hint distance and the checking condition to hardcoded 64, can be a good try. Although I think it would not be a right thing to do, in comparison with the similar logic in generate_compare_long_string_different_encoding. Thanks.
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/jdk.changeset
> if (SoftwarePrefetchHintDistance >= 0) {
> - __ bind(LARGE_LOOP_PREFETCH);
> + if (remainingLimit < largeLoopExitCondition) {
> + // there could be fewer bytes left and invalid for this large loop with prefetching
> + __ subs(rscratch2, cnt2, largeLoopExitCondition); // => subs(rscratch2, cnt2, 64); ??
> + __ br(__ LT, NO_PREFETCH);
> + }
> + __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
> __ prfm(Address(str1, SoftwarePrefetchHintDistance)); // => __ prfm(Address(str1, 64));
> __ prfm(Address(str2, SoftwarePrefetchHintDistance)); // => __ prfm(Address(str2, 64));
>
> Regards
> Patrick
>
> -----Original Message-----
> From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
> Sent: Thursday, November 7, 2019 6:56 PM
> To: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
>
> Hi Aleksei,
>
> Thanks for testing it and the data. I only had the source of StringCompareBench.java [6], my numbers (the diffs) are within 2%, while the -207.12% looks quite weird. I initially did not add the condition to control the br, since generate_ compare_long_string_different_encoding has the similar unconditional br. By the way, the original logic allowed prefetching the memory behind array border, for the first 64 bytes. I think securing the prefetch is the right thing to do, but it could certainly stop some cases from going to the large loop with prefetching. Welcome further comments, thanks.
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/
>
> Regards
> Patrick
>
> From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Sent: Thursday, November 7, 2019 12:53 AM
> To: Patrick Zhang OS <patrick at os.amperecomputing.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
>
>
> Hi Patrick,
>
> I like the fact that this patch does not add much to the complexity of the code. Here are some experiments that you could find useful.
> Cortex A73 Size base (ns/op) patched (ns/op) Diff
> StringCompareBench.StringCompareLL 256 14422257,98 15302300,24 -6,10%
> StringCompareBench.StringCompareLL 512 27998036,21 28317818,08 -1,14%
>
> ThunderX2 Size base (ns/op) patched (ns/op) Diff
> StringCompareBench.StringCompareLL 128 4265122,232 13099099,67 -207,12%
> StringCompareBench.StringCompareLL 256 3539452,533 3599407,432 -1,69%
>
> StringCompareBench.StringCompareUU 128 6899938,75 7174601,241 -3,98%
> StringCompareBench.StringCompareUU 256 7654538,841 7826599,466 -2,25%
>
> StringCompareBench.cachedStringCompareLL 128 19,673 21,242 -7,98%
> StringCompareBench.cachedStringCompareLL 256 34,179 36,452 -6,65%
> StringCompareBench.cachedStringCompareLL 512 59,574 64,088 -7,58%
> StringCompareBench.cachedStringCompareLL 1024 110,37 118,477 -7,35%
> StringCompareBench.cachedStringCompareLL 1000000 114028,907 115388,681 -1,19%
>
> StringCompareBench.cachedStringCompareUU 128 33,752 36,922 -9,39%
> StringCompareBench.cachedStringCompareUU 256 60,939 64,096 -5,18%
> StringCompareBench.cachedStringCompareUU 512 115,328 118,48 -2,73%
> StringCompareBench.cachedStringCompareUU 1024 239,332 242,97 -1,52%
> StringCompareBench.cachedStringCompareUU 1000000 226491,096 233638,328 -3,16%
> It might be the case that the newly added branch is the culprit:
>
> + __ subs(rscratch2, cnt2, largeLoopExitCondition);
> + __ br(__ LT, NO_PREFETCH);
>
> Maybe you could skip it when CompareLongStringLimitLatin and CompareLongStringLimitUTF are large enough (then stub code is only called with string length large enough to skip branch above). Then (the properly commented) code would look like:
>
> if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
> __ subs(rscratch2, cnt2, largeLoopExitCondition);
> __ br(__ LT, NO_PREFETCH);
> }
>
> and in this case we shouldn't see any performance penalties.
>
> -Aleksei
>
> On 29/10/2019 12:58, Patrick Zhang OS wrote:
>
> Hi,
>
>
>
> Could you please review this patch, thanks.
>
>
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8229351
>
> Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02
>
> (this starts from .02 since there had been some internal review and updates)
>
>
>
> Changes:
>
>
>
> 1. Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.
>
>
>
> 2. MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.
>
>
>
> 3. In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.
>
>
>
> 4. In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).
>
>
>
> 5. In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.
>
>
>
> Tests:
>
>
>
> 1. For function check, I have run
>
>
>
> jdk jtreg tier1 tests, with default vm flags
>
>
>
> hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"
>
>
>
> jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;
>
>
>
> some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].
>
>
>
> 1. For performance check, I have run
>
>
>
> string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,
>
>
>
> and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).
>
>
>
> FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.
>
>
>
> Refs:
>
> [1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string
>
> [2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string
>
> [3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko
>
> [4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic
>
> [5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev
>
> [6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko
>
>
>
> Regards
>
> Patrick
>
>
More information about the aarch64-port-dev
mailing list