RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Dinn adinn at openjdk.java.net
Mon Feb 8 14:54:44 UTC 2021


On Mon, 8 Feb 2021 12:21:22 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Go back a few years, and there were simple atomic load/store exclusive
>> instructions on Arm. Say you want to do an atomic increment of a
>> counter. You'd do an atomic load to get the counter into your local cache
>> in exclusive state, increment that counter locally, then write that
>> incremented counter back to memory with an atomic store. All the time
>> that cache line was in exclusive state, so you're guaranteed that
>> no-one else changed anything on that cache line while you had it.
>> 
>> This is hard to scale on a very large system (e.g. Fugaku) because if
>> many processors are incrementing that counter you get a lot of cache
>> line ping-ponging between cores.
>> 
>> So, Arm decided to add a locked memory increment instruction that
>> works without needing to load an entire line into local cache. It's a
>> single instruction that loads, increments, and writes back. The secret
>> is to send a cache control message to whichever processor owns the
>> cache line containing the count, tell that processor to increment the
>> counter and return the incremented value. That way cache coherency
>> traffic is mimimized. This new set of instructions is known as Large
>> System Extensions, or LSE.
>> 
>> Unfortunately, in recent processors, the "old" load/store exclusive
>> instructions, sometimes perform very badly. Therefore, it's now
>> necessary for software to detect which version of Arm it's running
>> on, and use the "new" LSE instructions if they're available. Otherwise
>> performance can be very poor under heavy contention.
>> 
>> GCC's -moutline-atomics does this by providing library calls which use
>> LSE if it's available, but this option is only provided on newer
>> versions of GCC. This is particularly problematic with older versions
>> of OpenJDK, which build using old GCC versions.
>> 
>> Also, I suspect that some other operating systems could use this.
>> Perhaps not MacOS, given that all Apple CPUs support LSE, but
>> maybe Windows.
>
> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 95:
> 
>> 93: template<>
>> 94: template<typename D, typename I>
>> 95: inline D Atomic::PlatformAdd<4>::fetch_and_add(D volatile* dest, I add_value,
> 
> It may be possible to avoid the cut-and-paste involved in declaring almost exactly the same template body for byte_size == 4 and 8 with a generic template which includes a function type element supplemented with two auxiliary templates to instantiate that function element with either aarch64_atomic_fetch_add_4_impl or aarch64_atomic_fetch_add_8_impl. That would mean more templates but less repetition. On the whole I prefer less templates so perhaps this is best left as is.

n.b. the same comment applies to the cut and paste for  PlatformCchg and PlatformCmpxchg

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434


More information about the hotspot-dev mailing list