[aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement

Mon Nov 18 12:02:13 UTC 2019

On 18/11/2019 7:03 AM, Patrick Zhang OS wrote:
>>> changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL will likely make some systems slower as I experimented a lot with this.
> Sorry my second paragraph was inaccurate, it seems you experimented that there were some cases ran well with the first iteration of the large-loop but would rather quit the loop and go to the small-loop immediately for better performance (?). Please correct me if I misunderstood this. Thanks.
>
> Regards
> Patrick

Yes. That's correct.

Thanks,
Dmitrij

>
> -----Original Message-----
> From: hotspot-compiler-dev <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Patrick Zhang OS
> Sent: Monday, November 18, 2019 11:52 AM
> To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net
> Subject: RE: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement
>
> Thanks for the information.
> I am interested in the inconsistence between same_encoding and different_encoding functions, if "overprefetch" can be safe enough, why do we prevent it at the end of large-loop inside same_encoding, why do we protect it more strictly in different_encoding at both the beginning and ending of the large-loop?
> I did not mean globally updating largeLoopExitCondition to 64/128, merely the condition at the end of large-loop inside same_encoding. Suppose large-loop could be faster than small-loop (in theory), removing all "overprefetch" conditions would allow more strings go to the large-loop for better performance. Any other potential side-effects?
>
> Regards
> Patrick
>
> -----Original Message-----
> From: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>
> Sent: Friday, November 15, 2019 11:52 PM
> To: Patrick Zhang OS <patrick at os.amperecomputing.com>
> Cc: hotspot-compiler-dev at openjdk.java.net; aarch64-port-dev at openjdk.java.net; Dmitry Samersoff <dms at samersoff.net>; Andrew Haley <aph at redhat.com>; Pengfei Li (Arm Technology China) <Pengfei.Li at arm.com>
> Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64: String::compareTo intrinsic documentation and maintenance improvement
>
> Hi Patrick,
>
> My experiments back then showed that few platforms (some of Cortex A*
> series) behaves unexpectedly slow when dealing with overprefetch (probably CPU implementation specifics). So this code is some kind of compromise to run relatively well on all platforms I was able to test on (ThunderX, ThunderX2, Cortex A53, Cortex A73). That is the main reason for such code structure.
> It's good that you're willing to experiment and improve it, but I'm afraid changing largeLoopExitCondition to 64 LL/UU and 128 for LU/UL will likely make some systems slower as I experimented a lot with this.
> Let us see the performance results for several systems you've got to avoid a situation when one platform benefits by slowing down others. We could offer some help if you don't have some HW available.
>
> Thanks,
> Dmitrij
>
> On 15/11/2019 10:51 AM, Patrick Zhang OS wrote:
>> Hi Dmitrij,
>>
>> The inline document inside your this patch is nice in helping me understand the string_compare stub code generation in depth, although I don't know why it was not pushed.
>> http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/src/hotspot/cpu
>> /aarch64/stubGenerator_aarch64.cpp.sdiff.html
>>
>> There is one point I don't quite understand, could you please clarify it a little more? Thanks in advance!
>> The large loop with prefetching logic, in different_encoding function, it uses "cnt2 - prefetchLoopExitCondition" to tell whether the 1st iteration should be executed or not, while the same_encoding does not do this, why?
>>
>> I was thinking that "prefetching out of the string boundary" could be an invalid operation, but it seems a misunderstanding, isn't it? The AArch64 ISA document says that prfm instruction signals the memory system and expects "preloading the cache line containing the specified address into one or more caches" would be done, in order to speed up the memory accesses when they do occur. If this is just a hint for coming ldrs, and safe enough, could we not restrict the rest iterations (2 ~ n) with largeLoopExitCondition? instead use 64 LL/UU, and 128 for LU/UL. As such more iteration can say in the large loop (SoftwarePrefetchHintDistance=192 for example), for better performance. Any comments?
>>
>> Thanks
>>
>> 4327   address generate_compare_long_string_different_encoding(bool isLU) {
>> 4377     if (SoftwarePrefetchHintDistance >= 0) {
>> 4378       __ subs(rscratch2, cnt2, prefetchLoopExitCondition);
>> 4379       __ br(__ LT, NO_PREFETCH);
>> 4380       __ bind(LARGE_LOOP_PREFETCH);                  // 64-characters loop
>> ... ...
>> 4395           __ subs(rscratch2, cnt2, prefetchLoopExitCondition); // <-- could we use subs(rscratch2, cnt2, 128) instead?
>> 4396           __ br(__ GE, LARGE_LOOP_PREFETCH);
>> 4397     } // end of 64-characters loop
>>
>> 4616   address generate_compare_long_string_same_encoding(bool isLL) {
>> 4637     if (SoftwarePrefetchHintDistance >= 0) {
>> 4638       __ bind(LARGE_LOOP_PREFETCH);
>> 4639         __ prfm(Address(str1, SoftwarePrefetchHintDistance));
>> 4640         __ prfm(Address(str2, SoftwarePrefetchHintDistance));
>> 4641         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4642         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4643         __ sub(cnt2, cnt2, 8 * characters_in_word);
>> 4644         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4645         __ subs(rscratch2, cnt2, largeLoopExitCondition); // rscratch2 is not used. Use subs instead of cmp in case of potentially large constants // <-- could we use subs(rscratch2, cnt2, 64) instead?
>> 4646         compare_string_16_bytes_same(DIFF, DIFF2);
>> 4647         __ br(__ GT, LARGE_LOOP_PREFETCH);
>> 4648         __ cbz(cnt2, LAST_CHECK);                    // no more loads left
>> 4649     }
>>
>> Regards
>> Patrick
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev
>> <hotspot-compiler-dev-bounces at openjdk.java.net> On Behalf Of Dmitry
>> Samersoff
>> Sent: Sunday, May 19, 2019 11:42 PM
>> To: Dmitrij Pochepko <dmitrij.pochepko at bell-sw.com>; Andrew Haley
>> <aph at redhat.com>; Pengfei Li (Arm Technology China)
>> <Pengfei.Li at arm.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net;
>> aarch64-port-dev at openjdk.java.net
>> Subject: Re: [aarch64-port-dev ] RFR: 8218748: AARCH64:
>> String::compareTo intrinsic documentation and maintenance improvement
>>
>> Dmitrij,
>>
>> The changes looks good to me.
>>
>> -Dmitry
>>
>> On 25.02.2019 19:52, Dmitrij Pochepko wrote:
>>> Hi Andrew, Pengfei,
>>>
>>> I created webrev.02 with all your suggestions implemented:
>>>
>>> webrev: http://cr.openjdk.java.net/~dpochepk/8218748/webrev.02/
>>>
>>> - comments are now both in separate section and inlined into code.
>>> - documentation mismatch mentioned by Pengfei is fixed:
>>> -- SHORT_LAST_INIT label name misprint changed to correct SHORT_LAST
>>> -- SHORT_LOOP_TAIL block now merged with last instruction.
>>> Documentation is updated respectively
>>> - minor other changes to layout and wording
>>>
>>> Newly developed tests were run as sanity and they passed.
>>>
>>> Thanks,
>>> Dmitrij
>>>
>>> On 22/02/2019 6:42 PM, Andrew Haley wrote:
>>>> On 2/22/19 10:31 AM, Pengfei Li (Arm Technology China) wrote:
>>>>
>>>>> So personally, I still prefer to inline the comments with the
>>>>> original code block to avoid this kind of inconsistencies. And it
>>>>> makes us easier to review or maintain the code together with the
>>>>> doc, as we don't need to scroll back and force. I don't know the
>>>>> benefit of making the code documentation as a separate part. What's
>>>>> your opinion, Andrew Haley?
>>>> I agree with you. There's no harm having both inline and separate.
>>>>