RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Mon Feb 8 14:54:43 UTC 2021

On Fri, 5 Feb 2021 18:56:46 GMT, Andrew Haley <aph at openjdk.org> wrote:

> Go back a few years, and there were simple atomic load/store exclusive
> instructions on Arm. Say you want to do an atomic increment of a
> counter. You'd do an atomic load to get the counter into your local cache
> in exclusive state, increment that counter locally, then write that
> incremented counter back to memory with an atomic store. All the time
> that cache line was in exclusive state, so you're guaranteed that
> no-one else changed anything on that cache line while you had it.
> 
> This is hard to scale on a very large system (e.g. Fugaku) because if
> many processors are incrementing that counter you get a lot of cache
> line ping-ponging between cores.
> 
> So, Arm decided to add a locked memory increment instruction that
> works without needing to load an entire line into local cache. It's a
> single instruction that loads, increments, and writes back. The secret
> is to send a cache control message to whichever processor owns the
> cache line containing the count, tell that processor to increment the
> counter and return the incremented value. That way cache coherency
> traffic is mimimized. This new set of instructions is known as Large
> System Extensions, or LSE.
> 
> Unfortunately, in recent processors, the "old" load/store exclusive
> instructions, sometimes perform very badly. Therefore, it's now
> necessary for software to detect which version of Arm it's running
> on, and use the "new" LSE instructions if they're available. Otherwise
> performance can be very poor under heavy contention.
> 
> GCC's -moutline-atomics does this by providing library calls which use
> LSE if it's available, but this option is only provided on newer
> versions of GCC. This is particularly problematic with older versions
> of OpenJDK, which build using old GCC versions.
> 
> Also, I suspect that some other operating systems could use this.
> Perhaps not MacOS, given that all Apple CPUs support LSE, but
> maybe Windows.

It would help to change the name of old_value to value. Otherwise this si ok as is.

src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 88:

> 86:   template<typename D, typename I>
> 87:   D add_and_fetch(D volatile* dest, I add_value, atomic_memory_order order) const {
> 88:     D old_value = fetch_and_add(dest, add_value, order) + add_value;

I'm not sure this should be called old_value. It is actually old_value + add_value i.e. it really ought to be called eiter new_value (or just value?).

src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 95:

> 93: template<>
> 94: template<typename D, typename I>
> 95: inline D Atomic::PlatformAdd<4>::fetch_and_add(D volatile* dest, I add_value,

It may be possible to avoid the cut-and-paste involved in declaring almost exactly the same template body for byte_size == 4 and 8 with a generic template which includes a function type element supplemented with two auxiliary templates to instantiate that function element with either aarch64_atomic_fetch_add_4_impl or aarch64_atomic_fetch_add_8_impl. That would mean more templates but less repetition. On the whole I prefer less templates so perhaps this is best left as is.

-------------

Marked as reviewed by adinn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2434