[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Thu Nov 14 11:13:48 UTC 2019

>> Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.
I was thinking out-of-boundary prefetching should be prevented, and UL/LU has the same condition, if no need, we could force set largeLoopExitCondition to be 64 so that more cases can freely stay in the large loop. I don't think so.
http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4194 
    if (SoftwarePrefetchHintDistance >= 0) {
      __ bind(LARGE_LOOP_PREFETCH);
        __ prfm(Address(str1, SoftwarePrefetchHintDistance));
        __ prfm(Address(str2, SoftwarePrefetchHintDistance));
        compare_string_16_bytes_same(DIFF, DIFF2);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ sub(cnt2, cnt2, isLL ? 64 : 32);
        compare_string_16_bytes_same(DIFF, DIFF2);
-        __ subs(rscratch2, cnt2, largeLoopExitCondition);
+        __ subs(rscratch2, cnt2, 64);
        compare_string_16_bytes_same(DIFF, DIFF2);
        __ br(__ GT, LARGE_LOOP_PREFETCH);
        __ cbz(cnt2, LAST_CHECK_AND_LENGTH_DIFF); // no more chars left?
    }

>> Do you have a theory that LU/UL cases are common? Why?
The only "theory" can be compare_string_16_x_LU (in the stub) is fater than the 8 bytes main loop (out of the stub) (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#l4997), even not in the large loop of stub, the small loop can be faster as well since it is able to process more bytes within fewer instructions (http://hg.openjdk.java.net/jdk/jdk/file/355f4f42dda5/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#l4203). 

I can prepare a new patch with the updates to tests, and plot the timings soon latter.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Thursday, November 14, 2019 6:33 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

On 11/14/19 9:20 AM, Patrick Zhang OS wrote:
> Thanks for the comments, see my answers below please.
> 
>>> 1. This patch seems to do rather a lot.

> Yes, it enables tweaking the stub parameters (not really changed any 
> in this patch), fixed an out-of-boundary prefetching for LL/UU, and 
> fixed some redundant instructions in LU/UL code path.  The latter two 
> are code-quality-wise, if splitting the patch could make the changes 
> clearer, I'd like to do.

Why do we care about out-of-boundary prefetching for LL/UU? I don't think we do if it requires any extra logic.

>>> 2. Are the thresholds bytes or characters?

> All thresholds are (and should be) in characters. This was a little 
> bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars, 
> while for UU it could be explained as bytes. If specified 
> -XX:-CompactStrings, all code path going to UU would make the 
> threshold mean bytes, which might confuse developers. This patch can 
> clarify it, and the description of tunable options can provide further 
> guidance.

It must. Without some commentary both maintainers and developers are lost. Unless there is some very strong reason, all counts must specify units.

>>> 3. How are we supposed to test with these different thresholds?

> There are two jtreg tests for checking the impacts of 
> SoftwarePrefetchHintDistance over the intrinsics, I have locally added 
> non-default thresholds inside and tested with many lengths (took days 
> on a test system). This has not been included in the proposed patch, 
> maybe a follow-up one would do, any advice?
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength
> .java 
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentL
> ength.java

I won't accept this patch unless it is accompanied by test cases that properly exercise the code.

>>> 4. What are the thresholds you tested? 

> Firstly, the default threshold, the hardcoded 72 is my testing focus 
> since I would try best not to bring negative impacts to aarch64-port 
> normal state, especially other CPU vendors.

> Second, I tested two extreme thresholds: 24 and 255, which means more 
> shorter strings (24 to 71 chars) or only very long strings
> (>=255) could go to the stub code path, respectively. Function tests 
> passed (listed in the initial email), while performance test results 
> (with string-density-bench, StringCompareBench.java, and
> SPECjbb2015) could be varying with different systems (as well as 
> microarchitectures).

> Third, some other non-default thresholds, as sanity check, 
> particularly for ensuring correctness.

It's the extremes that really matter, I suspect.

>>> 5. But the more serious problem is the fact that we have different 
>>> code paths for different microarchitectures, and somehow this has to 
>>> be standard supportable software. In order to test this stuff we'll 
>>> need different test parameters for SoftwarePrefetchHintDistance, 
>>> CompareLongStringLimitLatin, CompareLongStringLimitUTF

> The STUB_THRESHOLD was introduced to control the stub code insertion, 
> tested on some aarch64 systems. I think making it tunable is the way 
> to let different microarchitectures be able to configure optimal ones 
> for their own.

Well, yes. The question is whether we go down this rabbit hole or try to find a compromise that is perhaps not quite optimal for anyone but good enough for everyone.

> I would like to have a common threshold too, or no threshold for all, 
> but lacking of full-coverage tests over all systems. Maybe I 
> misunderstood you points here with regards to "supportable", the two 
> new options can be kept as default if developers have no concerns on 
> string compare intrinsics.

I rather suspect that vendors will want to change the defaults sooner or later. And besides, we'll all have to support these options.

>>> 6. We already emit a great deal of in-line code in the 
>>> string_compare intrinsic, with the intention that this be as fast as 
>>> possible because we want to avoid having to call the intrinsic. So 
>>> why is the intrinsic actually faster in your case?

> Avoid having to call the intrinsic?

I meant "the stub".

> If you did NOT mean completely "avoiding intrinsic", but the strings 
> shorter than 72 chars, I would have to say, "it depends". The stub 
> functions try best to process every 16 chars, while the outer logic 
> processes every 8 bytes, which is the major diff. For example, I can 
> see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe 
> others cannot, which can be reason why we need an option here.

I know that strings of length 24 - 30ish are very common, so this is an important case.

Do you have a theory that LU/UL cases are common? Why?

What is it like with LL/UU? I'd need to see real timings.

I'd either do all numbers < 256 or (to save time) a sequence like...

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

The idea here is that we an plot a graph. The timings should ideally be monotonically increasing.

And then we could see how different processors behave, and hopefully find a decent solution for all.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671