RFR: 8262519: AArch64: Unnecessary acquire semantics of memory-order-conservative atomics in C++ Hotspot code

Mon Mar 8 07:26:08 UTC 2021

On Fri, 5 Mar 2021 11:53:07 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>>> For us, we still have servers used by our customers that does not support LSE extension.
>>> Hm, from our point of view, ldaxr+stlxr+dmb and ldxr+stlxr+dmb provide the same order semantics.
>>> The acquire are used to ensure all loads/stores that are after an ldaxr (actually loads/stores after the dmb of atomic_*default*_impl in this case) in program order, while the dmb has already guaranteed this for us.
>>> Without the acquire, the loads/stores after the atomic operations still can not pass the dmb.
>> Remove the acquire does not change the order between preceding loads/stores and stlxr.
>> 
>> I agree that the code will still be correct if you change the ldaxr to ldar. While this may make some difference on machines which do not support LSE I would not expect it to be significant for anything other than a very carefully crafted benchmark or an extremely specialized parallel algorithm. Is this change request motivated by an actual real-world use case?
>
> Correction:
> 
> I agree that the code will still be correct if you change the ldaxr to *ldxr*.

> > Hm, from our point of view, ldaxr+stlxr+dmb and ldxr+stlxr+dmb provide the same order semantics.
> > The acquire are used to ensure all loads/stores that are after an ldaxr (actually loads/stores after the dmb of atomic__default__impl in this case) in program order, while the dmb has already guaranteed this for us.
> > Without the acquire, the loads/stores after the atomic operations still can not pass the dmb.
> > Remove the acquire does not change the order between preceding loads/stores and stlxr.
> 
> I agree that the code will still be correct if you change the ldaxr to ldar. While this may make some difference on machines which do not support LSE I would not expect it to be significant for anything other than a very carefully crafted benchmark or an extremely specialized parallel algorithm. Is this change request motivated by an actual real-world use case?

Thanks for the comments.
We only witnessed ~4% improvements with C code below on one of our platform.
The load-acquire is implemented as a full-barrier by the core in fact.
// highly-contended fetch_and_add test, %3.96 improvements with 8 threads
unsigned long res = 0;
unsigned long sum = 0;
extern unsigned long aarch64_atomic_fetch_add_8_default_impl(void *ptr, unsigned long val);

void *executor (void *arg)
{
  for (int i = 0; i < ITER; i++) {
    sum += aarch64_atomic_fetch_add_8_default_impl(&res, 1);
  }
}

int main(int argc, char **args)
{
  int i;
  pthread_t exethreads[THREADS];
  int threads = atoi(args[1]);

  for (i = 0; i < threads; i++)
    pthread_create(&exethreads[i], NULL, executor, NULL);
  for (i = 0; i < threads; i++)
    pthread_join(exethreads[i], NULL);

  return 0;
}
While we didn't have any noticeable improvements with JAVA tests we tried, e.g. `test/micro/org/openjdk/bench/vm/gc/Alloc.java`.
Seems the percentages of the atomics are too low, not to mention the actual real-world use case, e.g. Spark, Tomcat.

I guess we do not have to change the `ldaxr` to `ldxr` now, due to we haven't seen any significant performance enhancements yet. :-)
But I feel a little inconsistent that we have code use the stronger semantics, while a weaker instruction can still provide the correct order semantics.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2788