[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Patrick Zhang OS patrick at os.amperecomputing.com
Tue Nov 12 09:52:04 UTC 2019


Ping...

Hi Aleksei,

Does the potential regression on my test system still exist? with the new patch webrev.03? If the added largeLoopExitCondition condition excluded your >128 chars strings from the large loop, and still caused performance drops, maybe hacking all the software prefetch hint distance and the checking condition to hardcoded 64, can be a good try. Although I think it would not be a right thing to do, in comparison with the similar logic in generate_compare_long_string_different_encoding. Thanks. 

http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/jdk.changeset 
     if (SoftwarePrefetchHintDistance >= 0) {
-      __ bind(LARGE_LOOP_PREFETCH);
+      if (remainingLimit < largeLoopExitCondition) {
+        // there could be fewer bytes left and invalid for this large loop with prefetching
+        __ subs(rscratch2, cnt2, largeLoopExitCondition); // => subs(rscratch2, cnt2, 64); ??
+        __ br(__ LT, NO_PREFETCH);
+      }
+      __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
         __ prfm(Address(str1, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str1, 64));
         __ prfm(Address(str2, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str2, 64));

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Thursday, November 7, 2019 6:56 PM
To: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Hi Aleksei,

Thanks for testing it and the data. I only had the source of StringCompareBench.java [6], my numbers (the diffs) are within 2%, while the -207.12% looks quite weird. I initially did not add the condition to control the br, since generate_ compare_long_string_different_encoding has the similar unconditional br. By the way, the original logic allowed prefetching the memory behind array border, for the first 64 bytes. I think securing the prefetch is the right thing to do, but it could certainly stop some cases from going to the large loop with prefetching. Welcome further comments, thanks.

http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/

Regards
Patrick

From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
Sent: Thursday, November 7, 2019 12:53 AM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable


Hi Patrick,

I like the fact that this patch does not add much to the complexity of the code. Here are some experiments that you could find useful.
Cortex A73    Size    base (ns/op)    patched (ns/op)    Diff
StringCompareBench.StringCompareLL    256    14422257,98    15302300,24    -6,10%
StringCompareBench.StringCompareLL    512    27998036,21    28317818,08    -1,14%

ThunderX2    Size    base (ns/op)    patched (ns/op)    Diff
StringCompareBench.StringCompareLL    128    4265122,232    13099099,67    -207,12%
StringCompareBench.StringCompareLL    256    3539452,533    3599407,432    -1,69%

StringCompareBench.StringCompareUU    128    6899938,75    7174601,241    -3,98%
StringCompareBench.StringCompareUU    256    7654538,841    7826599,466    -2,25%

StringCompareBench.cachedStringCompareLL    128    19,673    21,242    -7,98%
StringCompareBench.cachedStringCompareLL    256    34,179    36,452    -6,65%
StringCompareBench.cachedStringCompareLL    512    59,574    64,088    -7,58%
StringCompareBench.cachedStringCompareLL    1024    110,37    118,477    -7,35%
StringCompareBench.cachedStringCompareLL    1000000    114028,907    115388,681    -1,19%

StringCompareBench.cachedStringCompareUU    128    33,752    36,922    -9,39%
StringCompareBench.cachedStringCompareUU    256    60,939    64,096    -5,18%
StringCompareBench.cachedStringCompareUU    512    115,328    118,48    -2,73%
StringCompareBench.cachedStringCompareUU    1024    239,332    242,97    -1,52%
StringCompareBench.cachedStringCompareUU    1000000    226491,096    233638,328    -3,16%
It might be the case that the newly added branch is the culprit:

+      __ subs(rscratch2, cnt2, largeLoopExitCondition);
+      __ br(__ LT, NO_PREFETCH);

Maybe you could skip it when CompareLongStringLimitLatin and CompareLongStringLimitUTF are large enough (then stub code is only called with string length large enough to skip branch above). Then (the properly commented) code would look like:

if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
     __ subs(rscratch2, cnt2, largeLoopExitCondition);
     __ br(__ LT, NO_PREFETCH);
}

and in this case we shouldn't see any performance penalties.

-Aleksei

On 29/10/2019 12:58, Patrick Zhang OS wrote:

Hi,



Could you please review this patch, thanks.



JBS: https://bugs.openjdk.java.net/browse/JDK-8229351

Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02

(this starts from .02 since there had been some internal review and updates)



Changes:



1.       Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.



2.       MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.



3.       In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.



4.       In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).



5.       In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.



Tests:



  1.  For function check, I have run



jdk jtreg tier1 tests, with default vm flags



hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"



jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;



some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].



  1.  For performance check, I have run



string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,



and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).



FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.



Refs:

[1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string

[2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string

[3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko

[4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic

[5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev

[6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko



Regards

Patrick




More information about the aarch64-port-dev mailing list