[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Wed Nov 13 12:27:26 UTC 2019

On 10/29/19 9:58 AM, Patrick Zhang OS wrote:

> 1.  Split the STUB_THRESHOLD from the hard-coded 72 to be
> CompareLongStringLimitLatin and CompareLongStringLimitUTF as a more
> flexible control over the stub thresholds for string_compare
> intrinsics, especially for various uArchs.
> 
> 2.  MacroAssembler::string_compare LL and UU shared the same
> threshold, actually UU may only require the half (length of chars)
> of that of LL's, because one character has two-bytes for UU, while
> for compacted LL strings, one character means one byte. In addition,
> LU/UL may need a separated threshold, as the stub function is
> different from the same encoding one, and the performance may vary
> as well.
> 
> 3.  In generate_compare_long_string_same_encoding, the hard-coded 72
> was originally able to ensure that there can be always 64 bytes at
> least for the prefetch code path. However once a smaller stub
> threshold is set, a new condition is needed to tell if this would be
> still valid, or has to go to the NO_PREFETCH branch. This change can
> ensure the correctness.
> 
> 4.  In generate_compare_long_string_different_encoding, some temp
> vars for handling the last 4 characters are not valid any longer,
> cleaned up strU and strL, and related pointers initialization to the
> next U (cnt1) and L (tmp2).
> 
> 5.  In compare_string_16_x_LU, the reference to r10 (tmp1) is not
> needed, as tmpU or tmpL point to the same register.

Thank you for your patch, but I'm afraid that I have some reservations.

This patch seems to do rather a lot.

What are the thresholds you tested? How are we supposed to test with
these different thresholds? Are the thresholds bytes or characters?
Why are the different thresholds not tested in this patch?

But the more serious problem is the fact that we have different code
paths for different microarchitectures, and somehow this has to be
standard supportable software. In order to test this stuff we'll need
different test parameters for SoftwarePrefetchHintDistance,
CompareLongStringLimitLatin, CompareLongStringLimitUTF.

Bear in mind that while manufacturers are (entirely reasonably) very
keen to show their processors in the best light possible, they are not
the people who will have to support this software and debug it when it
goes wrong. So there is a fundamental conflict of interest between
support people and CPU vendors.

We already emit a great deal of in-line code in the string_compare
intrinsic, with the intention that this be as fast as possible because
we want to avoid having to call the intrinsic. So why is the intrinsic
actually faster in your case? Could we not concentrate on that?

I -- and I'm sure it's not just me -- would be tremendously grateful
if all of the AArch64 developers would concentrate on improving code
quality overall rather than tweaking stub parameters.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671