[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Thu Nov 28 03:11:48 UTC 2019

Hi Andrew,

I collected the timings and did a comparison, please see the spread sheet in [1].
Per the comments from Dmitrij in another thread, and rethought the concerns you and Andrew Dinn reminded, I revised the patch to drop the tunable flags and the extra overprefech checking for LL/UU, then updated the shared STUB_THRESHOLD for UU/LU/UL respectively, according to the experimental data (but note that I only have one aarch64 system, the coverage might be limited). 
Please review.

JBS: https://bugs.openjdk.java.net/browse/JDK-8229351 
Webrev: http://cr.openjdk.java.net/~qpzhang/8229351/webrev.05

1. LL is the most common use case, especially the shorter strings, I will not change this as the tests cannot consistently produce a very positive data. So this part is same as what Dmitrij worked on, nothing changed.
2. UU used the same threshold 72 (chars) as LL, which meant 144 bytes for UU and 72 bytes for LL (with -XX:+CompactStrings by default). I updated it from 72 to 36 so the limit is fair to UU now. The test shows (see figure [2]) there is in average ~10% perf gains with [36, 71] characters, other lengths are same.
3. LU/UL, updated the threshold from 72 (chars) to 24 (chars). According to the algorithm in generate_compare_long_string_different_encoding, 24 is the minimum length that can take the advantage of compare_string_16_x_LU function to process 16 chars (32 bytes) in a loop, and can be faster than the outer function which processes every 8 bytes in the main loop. See figure [3], the perf gains are up to 60% (the secondary axis at the right)
4. Added " align(OptoLoopAlignment);" for main loops in the stub code, per early suggestions from Aleksei.
5. Updated the two relevant test cases under test/hotspot/jtreg/compiler/intrinsics/string, with additional string lengths that can better cover the cases referred in this patch.

More about the figures in the spread sheet and the JPG files (same):
The lengths are scatter points [4] suggested by Andrew, the main axis (at left) shows the times (multiplied by a const, so don't use the absolute values with ns/op), the blue dots shows the base trend, the orange dots belong to the patch, while the yellow dots (secondary axis) stand for the perf gains (patch vs base). For example, in [3], with the patch, the orange curve (patch) becomes be "monotonically increasing" with [24, 71] chars, which is better than the shape of blue curve (base) and it is what I (we) want.

Tests: jtreg tier1 all, hotspot_all, string-density-bench.jar, no regression found.

[1] http://cr.openjdk.java.net/~qpzhang/8229351/8229351-strcmp-perf.xlsx 
[2] http://cr.openjdk.java.net/~qpzhang/8229351/perf-strcmp-UU.JPG 
[3] http://cr.openjdk.java.net/~qpzhang/8229351/perf-strcmp-LU.JPG 
[4] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Friday, November 15, 2019 4:05 PM
To: Andrew Haley <aph at redhat.com>; Andrew Dinn <adinn at redhat.com>
Cc: aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

To avoid future confusion, I am going to split the patch, take out the updates for generate_compare_long_string_different_encoding, which drops two redundant temp Register vars and related unused instructions, then create a new for your review. It has nothing to do with the proposed option. 

And I will continue working the remaining parts according to your comments and suggestions..

Regards
Patrick

-----Original Message-----
From: aarch64-port-dev <aarch64-port-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
Sent: Thursday, November 14, 2019 7:14 PM
To: Andrew Haley <aph at redhat.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

>> Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.
I was thinking out-of-boundary prefetching should be prevented, and UL/LU has the same condition, if no need, we could force set largeLoopExitCondition to be 64 so that more cases can freely stay in the large loop. I don't think so.
http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4194 
    if (SoftwarePrefetchHintDistance >= 0) {
      __ bind(LARGE_LOOP_PREFETCH);
        __ prfm(Address(str1, SoftwarePrefetchHintDistance));
        __ prfm(Address(str2, SoftwarePrefetchHintDistance));
        compare_string_16_bytes_same(DIFF, DIFF2);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ sub(cnt2, cnt2, isLL ? 64 : 32);
        compare_string_16_bytes_same(DIFF, DIFF2);
-        __ subs(rscratch2, cnt2, largeLoopExitCondition);
+        __ subs(rscratch2, cnt2, 64);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ br(__ GT, LARGE_LOOP_PREFETCH);
        __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
    }

>> Do you have a theory that LU/UL cases are common? Why?
The only "theory" can be compare_string_16_x_LU (in the stub) is fater than the 8 bytes main loop (out of the stub) (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#l4997), even not in the large loop of stub, the small loop can be faster as well since it is able to process more bytes within fewer instructions (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4203). 

I can prepare a new patch with the updates to tests, and plot the timings soon latter.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com>
Sent: Thursday, November 14, 2019 6:33 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

On 11/14/19 9:20 AM, Patrick Zhang OS wrote:
> Thanks for the comments, see my answers below please.
> 
>>> 1. This patch seems to do rather a lot.

> Yes, it enables tweaking the stub parameters (not really changed any 
> in this patch), fixed an out-of-boundary prefetching for LL/UU, and 
> fixed some redundant instructions in LU/UL code path.  The latter two 
> are code-quality-wise, if splitting the patch could make the changes 
> clearer, I'd like to do.

Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.

>>> 2. Are the thresholds bytes or characters?

> All thresholds are (and should be) in characters. This was a little 
> bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars, 
> while for UU it could be explained as bytes. If specified 
> -XX:-CompactStrings, all code path going to UU would make the 
> threshold mean bytes, which might confuse developers. This patch can 
> clarify it, and the description of tunable options can provide further 
> guidance.

It must. Without some commentary both maintainers and developers are lost. Unless there is some very strong reason, all counts must specify units.

>>> 3. How are we supposed to test with these different thresholds?

> There are two jtreg tests for checking the impacts of 
> SoftwarePrefetchHintDistance over the intrinsics, I have locally added 
> non-default thresholds inside and tested with many lengths (took days 
> on a test system). This has not been included in the proposed patch, 
> maybe a follow-up one would do, any advice?
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength
> .java
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentL
> ength.java

I won't accept this patch unless it is accompanied by test cases that properly exercise the code.

>>> 4. What are the thresholds you tested? 

> Firstly, the default threshold, the hardcoded 72 is my testing focus 
> since I would try best not to bring negative impacts to aarch64-port 
> normal state, especially other CPU vendors.

> Second, I tested two extreme thresholds: 24 and 255, which means more 
> shorter strings (24 to 71 chars) or only very long strings
> (>=255) could go to the stub code path, respectively. Function tests 
> passed (listed in the initial email), while performance test results 
> (with string-density-bench, StringCompareBench.java, and
> SPECjbb2015) could be varying with different systems (as well as 
> microarchitectures).

> Third, some other non-default thresholds, as sanity check, 
> particularly for ensuring correctness.

It's the extremes that really matter, I suspect.

>>> 5. But the more serious problem is the fact that we have different 
>>> code paths for different microarchitectures, and somehow this has to 
>>> be standard supportable software. In order to test this stuff we'll 
>>> need different test parameters for SoftwarePrefetchHintDistance, 
>>> CompareLongStringLimitLatin, CompareLongStringLimitUTF

> The STUB_THRESHOLD was introduced to control the stub code insertion, 
> tested on some aarch64 systems. I think making it tunable is the way 
> to let different microarchitectures be able to configure optimal ones 
> for their own.

Well, yes. The question is whether we go down this rabbit hole or try to find a compromise that is perhaps not quite optimal for anyone but good enough for everyone.

> I would like to have a common threshold too, or no threshold for all, 
> but lacking of full-coverage tests over all systems. Maybe I 
> misunderstood you points here with regards to "supportable", the two 
> new options can be kept as default if developers have no concerns on 
> string compare intrinsics.

I rather suspect that vendors will want to change the defaults sooner or later. And besides, we'll all have to support these options.

>>> 6. We already emit a great deal of in-line code in the 
>>> string_compare intrinsic, with the intention that this be as fast as 
>>> possible because we want to avoid having to call the intrinsic. So 
>>> why is the intrinsic actually faster in your case?

> Avoid having to call the intrinsic?

I meant "the stub".

> If you did NOT mean completely "avoiding intrinsic", but the strings 
> shorter than 72 chars, I would have to say, "it depends". The stub 
> functions try best to process every 16 chars, while the outer logic 
> processes every 8 bytes, which is the major diff. For example, I can 
> see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe 
> others cannot, which can be reason why we need an option here.

I know that strings of length 24 - 30ish are very common, so this is an important case.

Do you have a theory that LU/UL cases are common? Why?

What is it like with LL/UU? I'd need to see real timings.

I'd either do all numbers < 256 or (to save time) a sequence like...

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

The idea here is that we an plot a graph. The timings should ideally be monotonically increasing.

And then we could see how different processors behave, and hopefully find a decent solution for all.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671