RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Tue Feb 9 11:15:31 UTC 2021

On Mon, 8 Feb 2021 18:50:09 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Go back a few years, and there were simple atomic load/store exclusive
>> instructions on Arm. Say you want to do an atomic increment of a
>> counter. You'd do an atomic load to get the counter into your local cache
>> in exclusive state, increment that counter locally, then write that
>> incremented counter back to memory with an atomic store. All the time
>> that cache line was in exclusive state, so you're guaranteed that
>> no-one else changed anything on that cache line while you had it.
>> 
>> This is hard to scale on a very large system (e.g. Fugaku) because if
>> many processors are incrementing that counter you get a lot of cache
>> line ping-ponging between cores.
>> 
>> So, Arm decided to add a locked memory increment instruction that
>> works without needing to load an entire line into local cache. It's a
>> single instruction that loads, increments, and writes back. The secret
>> is to send a cache control message to whichever processor owns the
>> cache line containing the count, tell that processor to increment the
>> counter and return the incremented value. That way cache coherency
>> traffic is mimimized. This new set of instructions is known as Large
>> System Extensions, or LSE.
>> 
>> Unfortunately, in recent processors, the "old" load/store exclusive
>> instructions, sometimes perform very badly. Therefore, it's now
>> necessary for software to detect which version of Arm it's running
>> on, and use the "new" LSE instructions if they're available. Otherwise
>> performance can be very poor under heavy contention.
>> 
>> GCC's -moutline-atomics does this by providing library calls which use
>> LSE if it's available, but this option is only provided on newer
>> versions of GCC. This is particularly problematic with older versions
>> of OpenJDK, which build using old GCC versions.
>> 
>> Also, I suspect that some other operating systems could use this.
>> Perhaps not MacOS, given that all Apple CPUs support LSE, but
>> maybe Windows.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review changes

Hi Andrew,

I'm happy to see this change after the [long](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039930.html) and [tedious](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039932.html) discussions we had about preferring C++ intrinsic over inline assembly :)

In general I'm fine with the change. Some of the previous C++ intrinsics (e.g. `__atomic_exchange_n` and `__atomic_add_fetch`) were called with `__ATOMIC_RELEASE` semantics which has now been dropped in the new implementation. But I think that's safe and a nice "optimization" because the instructions are followed by a full membar anyway.

One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.

Finally, I didn't fully understand how you've measured the `call..ret` overhead and what's the "*simple stright-line test*" you've posted performance numbers for.

Other than that, the change looks fine to me.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434