[aarch64-port-dev ] RFR: 8229351: AArch64: Make the stub threshold of string_compare intrinsic tunable

Thu Nov 14 10:33:00 UTC 2019

On 11/14/19 9:20 AM, Patrick Zhang OS wrote:
> Thanks for the comments, see my answers below please.
> 
>>> 1. This patch seems to do rather a lot.

> Yes, it enables tweaking the stub parameters (not really changed any
> in this patch), fixed an out-of-boundary prefetching for LL/UU, and
> fixed some redundant instructions in LU/UL code path.  The latter
> two are code-quality-wise, if splitting the patch could make the
> changes clearer, I'd like to do.

Why do we care about out-of-boundary prefetching for LL/UU? I don't
think we do if it requires any extra logic.

>>> 2. Are the thresholds bytes or characters?

> All thresholds are (and should be) in characters. This was a little
> bit misleading, for LL/LU/UL, the const STUB_THRESHOLD meant chars,
> while for UU it could be explained as bytes. If specified
> -XX:-CompactStrings, all code path going to UU would make the
> threshold mean bytes, which might confuse developers. This patch can
> clarify it, and the description of tunable options can provide
> further guidance.

It must. Without some commentary both maintainers and developers are
lost. Unless there is some very strong reason, all counts must specify
units.

>>> 3. How are we supposed to test with these different thresholds?

> There are two jtreg tests for checking the impacts of
> SoftwarePrefetchHintDistance over the intrinsics, I have locally
> added non-default thresholds inside and tested with many lengths
> (took days on a test system). This has not been included in the
> proposed patch, maybe a follow-up one would do, any advice?
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToSameLength.java
> hotspot/jtreg/compiler/intrinsics/string/TestStringCompareToDifferentLength.java

I won't accept this patch unless it is accompanied by test cases that
properly exercise the code.

>>> 4. What are the thresholds you tested? 

> Firstly, the default threshold, the hardcoded 72 is my testing focus
> since I would try best not to bring negative impacts to aarch64-port
> normal state, especially other CPU vendors.

> Second, I tested two extreme thresholds: 24 and 255, which means
> more shorter strings (24 to 71 chars) or only very long strings
> (>=255) could go to the stub code path, respectively. Function tests
> passed (listed in the initial email), while performance test results
> (with string-density-bench, StringCompareBench.java, and
> SPECjbb2015) could be varying with different systems (as well as
> microarchitectures).

> Third, some other non-default thresholds, as sanity check,
> particularly for ensuring correctness.

It's the extremes that really matter, I suspect.

>>> 5. But the more serious problem is the fact that we have different
>>> code paths for different microarchitectures, and somehow this has
>>> to be standard supportable software. In order to test this stuff
>>> we'll need different test parameters for
>>> SoftwarePrefetchHintDistance, CompareLongStringLimitLatin,
>>> CompareLongStringLimitUTF

> The STUB_THRESHOLD was introduced to control the stub code
> insertion, tested on some aarch64 systems. I think making it tunable
> is the way to let different microarchitectures be able to configure
> optimal ones for their own.

Well, yes. The question is whether we go down this rabbit hole or try
to find a compromise that is perhaps not quite optimal for anyone but
good enough for everyone.

> I would like to have a common threshold too, or no threshold for
> all, but lacking of full-coverage tests over all systems. Maybe I
> misunderstood you points here with regards to "supportable", the two
> new options can be kept as default if developers have no concerns on
> string compare intrinsics.

I rather suspect that vendors will want to change the defaults sooner
or later. And besides, we'll all have to support these options.

>>> 6. We already emit a great deal of in-line code in the
>>> string_compare intrinsic, with the intention that this be as fast
>>> as possible because we want to avoid having to call the
>>> intrinsic. So why is the intrinsic actually faster in your case?

> Avoid having to call the intrinsic?

I meant "the stub".

> If you did NOT mean completely "avoiding intrinsic", but the strings
> shorter than 72 chars, I would have to say, "it depends". The stub
> functions try best to process every 16 chars, while the outer logic
> processes every 8 bytes, which is the major diff. For example, I can
> see consistent 1.5x faster with lengths 24-71 for LU/UL cases, maybe
> others cannot, which can be reason why we need an option here.

I know that strings of length 24 - 30ish are very common, so this is
an important case.

Do you have a theory that LU/UL cases are common? Why?

What is it like with LL/UU? I'd need to see real timings.

I'd either do all numbers < 256 or (to save time) a sequence like...

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 19, 21, 23, 25, 28, 30, 34, 37, 41, 45, 49, 54, 60, 66, 72, 80, 88, 97, 106, 117, 129, 142, 156, 171, 189, 207, 228, 251

The idea here is that we an plot a graph. The timings should ideally
be monotonically increasing.

And then we could see how different processors behave, and hopefully
find a decent solution for all.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671