RFR: 8262519: AArch64: Unnecessary acquire semantics of memory-order-conservative atomics in C++ Hotspot code

Tue Mar 9 02:30:10 UTC 2021

On Mon, 8 Mar 2021 07:23:04 GMT, Dong Bo <dongbo at openjdk.org> wrote:

>> Correction:
>> 
>> I agree that the code will still be correct if you change the ldaxr to *ldxr*.
>
>> > Hm, from our point of view, ldaxr+stlxr+dmb and ldxr+stlxr+dmb provide the same order semantics.
>> > The acquire are used to ensure all loads/stores that are after an ldaxr (actually loads/stores after the dmb of atomic__default__impl in this case) in program order, while the dmb has already guaranteed this for us.
>> > Without the acquire, the loads/stores after the atomic operations still can not pass the dmb.
>> > Remove the acquire does not change the order between preceding loads/stores and stlxr.
>> 
>> I agree that the code will still be correct if you change the ldaxr to ldar. While this may make some difference on machines which do not support LSE I would not expect it to be significant for anything other than a very carefully crafted benchmark or an extremely specialized parallel algorithm. Is this change request motivated by an actual real-world use case?
> 
> Thanks for the comments.
> We only witnessed ~4% improvements with C code below on one of our platform.
> The load-acquire is implemented as a full-barrier by the core in fact.
> // highly-contended fetch_and_add test, %3.96 improvements with 8 threads
> unsigned long res = 0;
> unsigned long sum = 0;
> extern unsigned long aarch64_atomic_fetch_add_8_default_impl(void *ptr, unsigned long val);
> 
> void *executor (void *arg)
> {
>   for (int i = 0; i < ITER; i++) {
>     sum += aarch64_atomic_fetch_add_8_default_impl(&res, 1);
>   }
> }
> 
> int main(int argc, char **args)
> {
>   int i;
>   pthread_t exethreads[THREADS];
>   int threads = atoi(args[1]);
> 
>   for (i = 0; i < threads; i++)
>     pthread_create(&exethreads[i], NULL, executor, NULL);
>   for (i = 0; i < threads; i++)
>     pthread_join(exethreads[i], NULL);
> 
>   return 0;
> }
> While we didn't have any noticeable improvements with JAVA tests we tried, e.g. `test/micro/org/openjdk/bench/vm/gc/Alloc.java`.
> Seems the percentages of the atomics are too low, not to mention the actual real-world use case, e.g. Spark, Tomcat.
> 
> I guess we do not have to change the `ldaxr` to `ldxr` now, due to we haven't seen any significant performance enhancements yet. :-)
> But I feel a little inconsistent that we have code use the stronger semantics, while a weaker instruction can still provide the correct order semantics.

OK, withdrawing... Thank you all for the time.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2788