RFR: 8262519: AArch64: Unnecessary acquire semantics of memory-order-conservative atomics in C++ Hotspot code

Tue Mar 9 09:06:07 UTC 2021

On Tue, 9 Mar 2021 02:27:07 GMT, Dong Bo <dongbo at openjdk.org> wrote:

>>> > Hm, from our point of view, ldaxr+stlxr+dmb and ldxr+stlxr+dmb provide the same order semantics.
>>> > The acquire are used to ensure all loads/stores that are after an ldaxr (actually loads/stores after the dmb of atomic__default__impl in this case) in program order, while the dmb has already guaranteed this for us.
>>> > Without the acquire, the loads/stores after the atomic operations still can not pass the dmb.
>>> > Remove the acquire does not change the order between preceding loads/stores and stlxr.
>>> 
>>> I agree that the code will still be correct if you change the ldaxr to ldar. While this may make some difference on machines which do not support LSE I would not expect it to be significant for anything other than a very carefully crafted benchmark or an extremely specialized parallel algorithm. Is this change request motivated by an actual real-world use case?
>> 
>> Thanks for the comments.
>> We only witnessed ~4% improvements with C code below on one of our platform.
>> The load-acquire is implemented as a full-barrier by the core in fact.
>> // highly-contended fetch_and_add test, %3.96 improvements with 8 threads
>> unsigned long res = 0;
>> unsigned long sum = 0;
>> extern unsigned long aarch64_atomic_fetch_add_8_default_impl(void *ptr, unsigned long val);
>> 
>> void *executor (void *arg)
>> {
>>   for (int i = 0; i < ITER; i++) {
>>     sum += aarch64_atomic_fetch_add_8_default_impl(&res, 1);
>>   }
>> }
>> 
>> int main(int argc, char **args)
>> {
>>   int i;
>>   pthread_t exethreads[THREADS];
>>   int threads = atoi(args[1]);
>> 
>>   for (i = 0; i < threads; i++)
>>     pthread_create(&exethreads[i], NULL, executor, NULL);
>>   for (i = 0; i < threads; i++)
>>     pthread_join(exethreads[i], NULL);
>> 
>>   return 0;
>> }
>> While we didn't have any noticeable improvements with JAVA tests we tried, e.g. `test/micro/org/openjdk/bench/vm/gc/Alloc.java`.
>> Seems the percentages of the atomics are too low, not to mention the actual real-world use case, e.g. Spark, Tomcat.
>> 
>> I guess we do not have to change the `ldaxr` to `ldxr` now, due to we haven't seen any significant performance enhancements yet. :-)
>> But I feel a little inconsistent that we have code use the stronger semantics, while a weaker instruction can still provide the correct order semantics.
>
> OK, withdrawing... Thank you all for the time.

> _Mailing list message from [Andrew Dinn](mailto:adinn at redhat.com) on [hotspot-dev](mailto:hotspot-dev at openjdk.java.net):_
> 
> I believe it would be better in future to perform investigations when
> there is a real world use case indicating there is a problem to be fixed
> and that the fix will make a difference. Looking into this change has
> taken quite a lot of your time and a lot of reviewer time. That time
> could probably have been better spent looking into other things.

I don't completely agree. It seems to make sense to concentrate only on measurable things. However, efficient systems are composed from thousands of tiny optimizations, each one of which may be too small to measure on its own. The AArch64 HotSpot we have today would be less efficient if we'd insisted on strict performance justification for each little optimization.

My misgivings about this change are due to not being entirely convinced we've found all the corner cases, and a worry (born of experience!) that some AArch64 implementations may be pushed into marginal behaviour. Even though such misbehaviour is unlikely today, I'm not sure it's worth the risk.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2788