[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Tue Nov 12 16:10:28 UTC 2019

Hi Patrick,

First, I'm a trespasser, not a reviewer. Reviewers will need to look at
this.

On the technical side:

This additional branch in v3 is still painful. You can reduce the amount
of branches in code path for lengths 128 and 256 by using that fact that
CompareLongStringLimitLatin and CompareLongStringLimitUTF are at least
24. Then we don't have to jump to NO_PREFETCH label, where check for
small string size is done. Instead we can jump to SMALL_LOOP label
(assuming cnt2 counter is updated accordingly). In this case NO_PREFETCH
label is not needed and we have 1 less branch.

Rough sketch of affected part based on v3 looks as follows. This version
was checked on ThunderX2 and it looks fine on length 128 and 256
perf-wise. I also added alignment for small loop, which also helps a
bit. Please keep in mind it's a sketch.

-Aleksei

@@ -4172,19 +4168,34 @@
     Register result = r0, str1 = r1, cnt1 = r2, str2 = r3, cnt2 = r4,
         tmp1 = r10, tmp2 = r11;
     Label SMALL_LOOP, LARGE_LOOP_PREFETCH, CHECK_LAST, DIFF2, TAIL,
-        LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF,
+        LENGTH_DIFF, DIFF, LAST_CHECK_AND_LENGTH_DIFF, SMALL_LOOP_CHECK,
         DIFF_LAST_POSITION, DIFF_LAST_POSITION2;
     // exit from large loop when less than 64 bytes left to read or
we're about
     // to prefetch memory behind array border
     int largeLoopExitCondition = MAX(64,
SoftwarePrefetchHintDistance)/(isLL ? 1 : 2);
+    // calculate the remaining limit in chars which manages if this
stub should be called,
+    // if the limit is large enough (>= largeLoopExitCondition), below
large loop with prefetching
+    // can be executed at least once, and there is no need to do any
extra checking at the entrance.
+    int remainingLimit = (isLL ? CompareLongStringLimitLatin :
CompareLongStringLimitUTF) -
+                         (wordSize / (isLL ? 1 : 2));
     // cnt1/cnt2 contains amount of characters to compare. cnt1 can be
re-used
     // update cnt2 counter with already loaded 8 bytes
-    __ sub(cnt2, cnt2, wordSize/(isLL ? 1 : 2));
+    if (SoftwarePrefetchHintDistance >= 0 && remainingLimit <
largeLoopExitCondition) {
+      __ sub(cnt2, cnt2, isLL ? 24 : 12);
+    } else {
+      __ sub(cnt2, cnt2, isLL ? 8 : 4);
+    }
     // update pointers, because of previous read
     __ add(str1, str1, wordSize);
     __ add(str2, str2, wordSize);
     if (SoftwarePrefetchHintDistance >= 0) {
-      __ bind(LARGE_LOOP_PREFETCH);
+      if (remainingLimit < largeLoopExitCondition) {
+        // there could be fewer bytes left and invalid for this large
loop with prefetching
+        __ subs(rscratch2, cnt2, largeLoopExitCondition - (isLL ? 16 :
8));
+        __ br(__ LT, SMALL_LOOP);
+        __ add(cnt2, cnt2, isLL ? 16 : 8);
+      }
+      __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
         __ prfm(Address(str1, SoftwarePrefetchHintDistance));
         __ prfm(Address(str2, SoftwarePrefetchHintDistance));
         compare_string_16_bytes_same(DIFF, DIFF2);
@@ -4196,11 +4207,11 @@
         __ br(__ GT, LARGE_LOOP_PREFETCH);
         __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
     }
-    // less than 16 bytes left?
-    __ subs(cnt2, cnt2, isLL ? 16 : 8);
-    __ br(__ LT, TAIL);
+    __ b(SMALL_LOOP_CHECK); // check if less than 16 bytes left
+    __ align(OptoLoopAlignment);
     __ bind(SMALL_LOOP);
       compare_string_16_bytes_same(DIFF, DIFF2);
+      __ bind(SMALL_LOOP_CHECK);
       __ subs(cnt2, cnt2, isLL ? 16 : 8);
       __ br(__ GE, SMALL_LOOP);
     __ bind(TAIL);



On 12/11/2019 12:52, Patrick Zhang OS wrote:
> Ping...
>
> Hi Aleksei,
>
> Does the potential regression on my test system still exist? with the new patch webrev.03? If the added largeLoopExitCondition condition excluded your >128 chars strings from the large loop, and still caused performance drops, maybe hacking all the software prefetch hint distance and the checking condition to hardcoded 64, can be a good try. Although I think it would not be a right thing to do, in comparison with the similar logic in generate_compare_long_string_different_encoding. Thanks. 
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/jdk.changeset 
>      if (SoftwarePrefetchHintDistance >= 0) {
> -      __ bind(LARGE_LOOP_PREFETCH);
> +      if (remainingLimit < largeLoopExitCondition) {
> +        // there could be fewer bytes left and invalid for this large loop with prefetching
> +        __ subs(rscratch2, cnt2, largeLoopExitCondition); // => subs(rscratch2, cnt2, 64); ??
> +        __ br(__ LT, NO_PREFETCH);
> +      }
> +      __ bind(LARGE_LOOP_PREFETCH); // 64 bytes loop
>          __ prfm(Address(str1, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str1, 64));
>          __ prfm(Address(str2, SoftwarePrefetchHintDistance)); //  => __ prfm(Address(str2, 64));
>
> Regards
> Patrick
>
> -----Original Message-----
> From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
> Sent: Thursday, November 7, 2019 6:56 PM
> To: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
>
> Hi Aleksei,
>
> Thanks for testing it and the data. I only had the source of StringCompareBench.java [6], my numbers (the diffs) are within 2%, while the -207.12% looks quite weird. I initially did not add the condition to control the br, since generate_ compare_long_string_different_encoding has the similar unconditional br. By the way, the original logic allowed prefetching the memory behind array border, for the first 64 bytes. I think securing the prefetch is the right thing to do, but it could certainly stop some cases from going to the large loop with prefetching. Welcome further comments, thanks.
>
> http://cr.openjdk.java.net/~qpzhang/8229351/webrev.03/
>
> Regards
> Patrick
>
> From: Aleksei Voitylov <aleksei.voitylov at bell-sw.com>
> Sent: Thursday, November 7, 2019 12:53 AM
> To: Patrick Zhang OS <patrick at os.amperecomputing.com>
> Cc: aarch64-port-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable
>
>
> Hi Patrick,
>
> I like the fact that this patch does not add much to the complexity of the code. Here are some experiments that you could find useful.
> Cortex A73    Size    base (ns/op)    patched (ns/op)    Diff
> StringCompareBench.StringCompareLL    256    14422257,98    15302300,24    -6,10%
> StringCompareBench.StringCompareLL    512    27998036,21    28317818,08    -1,14%
>
> ThunderX2    Size    base (ns/op)    patched (ns/op)    Diff
> StringCompareBench.StringCompareLL    128    4265122,232    13099099,67    -207,12%
> StringCompareBench.StringCompareLL    256    3539452,533    3599407,432    -1,69%
>
> StringCompareBench.StringCompareUU    128    6899938,75    7174601,241    -3,98%
> StringCompareBench.StringCompareUU    256    7654538,841    7826599,466    -2,25%
>
> StringCompareBench.cachedStringCompareLL    128    19,673    21,242    -7,98%
> StringCompareBench.cachedStringCompareLL    256    34,179    36,452    -6,65%
> StringCompareBench.cachedStringCompareLL    512    59,574    64,088    -7,58%
> StringCompareBench.cachedStringCompareLL    1024    110,37    118,477    -7,35%
> StringCompareBench.cachedStringCompareLL    1000000    114028,907    115388,681    -1,19%
>
> StringCompareBench.cachedStringCompareUU    128    33,752    36,922    -9,39%
> StringCompareBench.cachedStringCompareUU    256    60,939    64,096    -5,18%
> StringCompareBench.cachedStringCompareUU    512    115,328    118,48    -2,73%
> StringCompareBench.cachedStringCompareUU    1024    239,332    242,97    -1,52%
> StringCompareBench.cachedStringCompareUU    1000000    226491,096    233638,328    -3,16%
> It might be the case that the newly added branch is the culprit:
>
> +      __ subs(rscratch2, cnt2, largeLoopExitCondition);
> +      __ br(__ LT, NO_PREFETCH);
>
> Maybe you could skip it when CompareLongStringLimitLatin and CompareLongStringLimitUTF are large enough (then stub code is only called with string length large enough to skip branch above). Then (the properly commented) code would look like:
>
> if ((stub_threshold-wordSize/(isLL ? 1 : 2)) < largeLoopExitCondition) {
>      __ subs(rscratch2, cnt2, largeLoopExitCondition);
>      __ br(__ LT, NO_PREFETCH);
> }
>
> and in this case we shouldn't see any performance penalties.
>
> -Aleksei
>
> On 29/10/2019 12:58, Patrick Zhang OS wrote:
>
> Hi,
>
>
>
> Could you please review this patch, thanks.
>
>
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8229351
>
> Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.02
>
> (this starts from .02 since there had been some internal review and updates)
>
>
>
> Changes:
>
>
>
> 1.       Split the STUB_THRESHOLD from the hard-coded 72 to be CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more flexible control over the stub thresholds for string_compare intrinsics, especially for various uArchs.
>
>
>
> 2.       MacroAssembler::string_compare LL and UU shared the same threshold, actually UU may only require the half (length of chars) of that of LL's, because one character has two-bytes for UU, while for compacted LL strings, one character means one byte. In addition, LU/UL may need a separated threshold, as the stub function is different from the same encoding one, and the performance may vary as well.
>
>
>
> 3.       In generate_compare_long_string_same_encoding, the hard-coded 72 was originally able to ensure that there can be always 64 bytes at least for the prefetch code path. However once a smaller stub threshold is set, a new condition is needed to tell if this would be still valid, or has to go to the NO_PREFETCH branch. This change can ensure the correctness.
>
>
>
> 4.       In generate_compare_long_string_different_encoding, some temp vars for handling the last 4 characters are not valid any longer, cleaned up strU and strL, and related pointers initialization to the next U (cnt1) and L (tmp2).
>
>
>
> 5.       In compare_string_16_x_LU, the reference to r10 (tmp1) is not needed, as tmpU or tmpL point to the same register.
>
>
>
> Tests:
>
>
>
>   1.  For function check, I have run
>
>
>
> jdk jtreg tier1 tests, with default vm flags
>
>
>
> hotspot jtreg tests: runtime/compiler/gc parts, with "-Xcomp -XX:-TieredCompilation"
>
>
>
> jck10/api/java.lang 1609 cases and other selected modules, no new failures found, with default vm flags and "-Xcomp -XX:-TieredCompilation" respectively;
>
>
>
> some specific test cases had been carefully executed to double check, i.e., TestStringCompareToDifferentLength.java [1] and TestStringCompareToDifferentLength.java [1] introduced by [2], StrCmpTest.java [3] introduced by [4].
>
>
>
>   1.  For performance check, I have run
>
>
>
> string-density-bench/CompareToBench.java [5] and StringCompareBench.java [6] respectively,
>
>
>
> and SPECjbb2015.jar, no obvious performance change has been found (since the default threshold is NOT changed within this patch).
>
>
>
> FYI. with Ampere eMAG system, microbenchmarks [5][6] can have 1.5x consistent perf gain with LU/UL comparison for shorter strings (<72 chars, smaller stub thresholds), and slight improvement (5~10%) with LL/UU cases.
>
>
>
> Refs:
>
> [1] http://hg.openjdk.java.net/jdk/jdk/file/3df2bf731a87/test/hotspot/jtreg/compiler/intrinsics/string
>
> [2] https://bugs.openjdk.java.net/browse/JDK-8218966 AArch64: String.compareTo() can read memory after string
>
> [3] http://cr.openjdk.java.net/~dpochepk/8202326/StrCmpTest.java, contributed by Dmitrij Pochepko
>
> [4] https://bugs.openjdk.java.net/browse/JDK-8202326 AARCH64: optimize string compare intrinsic
>
> [5] http://cr.openjdk.java.net/~shade/density/string-density-bench.jar, contributed by Aleksey Shipilev
>
> [6] http://cr.openjdk.java.net/~dpochepk/8202326/StringCompareBench.java, contributed by Dmitrij Pochepko
>
>
>
> Regards
>
> Patrick
>
>