RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Tue Feb 9 17:07:38 UTC 2021

On Tue, 9 Feb 2021 13:49:45 GMT, Andrew Haley <aph at openjdk.org> wrote:

> > I'm happy to see this change after the [long](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039930.html) and [tedious](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039932.html) discussions we had about preferring C++ intrinsic over inline assembly :)
> > In general I'm fine with the change. Some of the previous C++ intrinsics (e.g. `__atomic_exchange_n` and `__atomic_add_fetch`) were called with `__ATOMIC_RELEASE` semantics which has now been dropped in the new implementation. But I think that's safe and a nice "optimization" because the instructions are followed by a full membar anyway.
> 
> None of these sequences is ideal, so I'll follow up with some higher-performance LSE versions in a new patch.
> 
> > One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.
> 
> We'd need an instance of Assembler very early, before the JVM is initialized. It could be done, but it would also a page of memory to be allocated early too. I did try, but it was rather awful. That is why I ended up with these simple bootstrapping versions of the atomics.
> 

OK, I see. Bootstrapping is more complex than I thought :)

But nevertheless I think implementing the default versions in native assembly isn't really simple and putting that Linux/gcc specific assembly code into the generic aarch64 directory `src/hotspot/cpu/aarch64` will break other aarch64 platforms like Windows and Mac. Why don't you leave the default implementation as simple wrappers for the C++ compiler intrinsics somewhere under `src/hotspot/os_cpu/linux_aarch64` and remove the:
    if (! UseLSE) {
      return;
    }
in `generate_atomic_entry_points()`? In that case, the intrinsic versions would really only be used for bootstrapping until the stubs have been generated. I don't see a good reason why we should maintain two different assembler implementations of the non-LSE atomics for aarch64 - one in native assembler and on in generated assembler.

> > Finally, I didn't fully understand how you've measured the `call..ret` overhead and what's the "_simple stright-line test_" you've posted performance numbers for.
> 
> That was just a counter in a loop. It's not terribly important for this case, given that the current code is way from optimal, but I wanted to know if the call...ret overhead could be measured. It can't, because it's swamped by the cost of the barriers.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434