[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Thu Nov 14 09:20:01 UTC 2019

Thanks for the comments, see my answers below please.

>> 1. This patch seems to do rather a lot.
Yes, it enables tweaking the stub parameters (not really changed any in this patch), fixed an out-of-boundary prefetching for LL/UU, and fixed some redundant instructions in LU/UL code path. 
The latter two are code-quality-wise, if splitting the patch could make the changes clearer, I'd like to do.

>> 2. Are the thresholds bytes or characters?
All thresholds are (and should be) in characters. This was a little bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars, while for UU it could be explained as bytes. If specified -XX:-CompactStrings, all code path going to UU would make the threshold mean bytes, which might confuse developers. This patch can clarify it, and the description of tunable options can provide further guidance.

>> 3. How are we supposed to test with these different thresholds?
There are two jtreg tests for checking the impacts of SoftwarePrefetchHintDistance over the intrinsics, I have locally added non-default thresholds inside and tested with many lengths (took days on a test system). This has not been included in the proposed patch, maybe a follow-up one would do, any advice?
hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength.java
hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentLength.java

>> 4. What are the thresholds you tested? 
Firstly, the default threshold, the hardcoded 72 is my testing focus since I would try best not to bring negative impacts to aarch64-port normal state, especially other CPU vendors.
Second, I tested two extreme thresholds: 24 and 255, which means more shorter strings (24 to 71 chars) or only very long strings (>=255) could go to the stub code path, respectively. Function tests passed (listed in the initial email), while performance test results (with string-density-bench, StringCompareBench.java, and SPECjbb2015) could be varying with different systems (as well as microarchitectures).
Third, some other non-default thresholds, as sanity check, particularly for ensuring correctness.

>> 5. But the more serious problem is the fact that we have different code paths for different microarchitectures, and somehow this has to be standard supportable software. In order to test this stuff we'll need different test parameters for SoftwarePrefetchHintDistance, CompareLongStringLimitLatin, CompareLongStringLimitUTF
The STUB_THRESHOLD was introduced to control the stub code insertion, tested on some aarch64 systems. I think making it tunable is the way to let different microarchitectures be able to configure optimal ones for their own. I would like to have a common threshold too, or no threshold for all, but lacking of full-coverage tests over all systems. Maybe I misunderstood you points here with regards to "supportable", the two new options can be kept as default if developers have no concerns on string compare intrinsics.

>> 6. We already emit a great deal of in-line code in the string_compare intrinsic, with the intention that this be as fast as possible because we want to avoid having to call the intrinsic. So why is the intrinsic actually faster in your case? 
Avoid having to call the intrinsic? Per my testing results with microbenchmarks like string-density-bench.jar, the LL cases can be up to 10x faster than the non-intrinsic path, while for some public benchmarks with SPECjbb, Renaissance, 99% string_compare inside are LL, the intrinsics definitely can help a lot as well. If you did NOT mean completely "avoiding intrinsic", but the strings shorter than 72 chars, I would have to say, "it depends". The stub functions try best to process every 16 chars, while the outer logic processes every 8 bytes, which is the major diff. For example, I can see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe others cannot, which can be reason why we need an option here.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Wednesday, November 13, 2019 8:27 PM
To: Patrick Zhang OS <patrick at os.amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

On 10/29/19 9:58 AM, Patrick Zhang OS wrote:

> 1.  Split the STUB_THRESHOLD from the hard-coded 72 to be 
> CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more 
> flexible control over the stub thresholds for string_compare 
> intrinsics, especially for various uArchs.
> 
> 2.  MacroAssembler::string_compare LL and UU shared the same 
> threshold, actually UU may only require the half (length of chars) of 
> that of LL's, because one character has two-bytes for UU, while for 
> compacted LL strings, one character means one byte. In addition, LU/UL 
> may need a separated threshold, as the stub function is different from 
> the same encoding one, and the performance may vary as well.
> 
> 3.  In generate_compare_long_string_same_encoding, the hard-coded 72 
> was originally able to ensure that there can be always 64 bytes at 
> least for the prefetch code path. However once a smaller stub 
> threshold is set, a new condition is needed to tell if this would be 
> still valid, or has to go to the NO_PREFETCH branch. This change can 
> ensure the correctness.
> 
> 4.  In generate_compare_long_string_different_encoding, some temp vars 
> for handling the last 4 characters are not valid any longer, cleaned 
> up strU and strL, and related pointers initialization to the next U 
> (cnt1) and L (tmp2).
> 
> 5.  In compare_string_16_x_LU, the reference to r10 (tmp1) is not 
> needed, as tmpU or tmpL point to the same register.

Thank you for your patch, but I'm afraid that I have some reservations.

This patch seems to do rather a lot.

What are the thresholds you tested? How are we supposed to test with these different thresholds? Are the thresholds bytes or characters?
Why are the different thresholds not tested in this patch?

But the more serious problem is the fact that we have different code paths for different microarchitectures, and somehow this has to be standard supportable software. In order to test this stuff we'll need different test parameters for SoftwarePrefetchHintDistance, CompareLongStringLimitLatin, CompareLongStringLimitUTF.

Bear in mind that while manufacturers are (entirely reasonably) very keen to show their processors in the best light possible, they are not the people who will have to support this software and debug it when it goes wrong. So there is a fundamental conflict of interest between support people and CPU vendors.

We already emit a great deal of in-line code in the string_compare intrinsic, with the intention that this be as fast as possible because we want to avoid having to call the intrinsic. So why is the intrinsic actually faster in your case? Could we not concentrate on that?

I -- and I'm sure it's not just me -- would be tremendously grateful if all of the AArch64 developers would concentrate on improving code quality overall rather than tweaking stub parameters.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com> https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671