RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Sat Feb 6 09:27:41 UTC 2021

On Fri, 5 Feb 2021 18:56:46 GMT, Andrew Haley <aph at openjdk.org> wrote:

> Go back a few years, and there were simple atomic load/store exclusive
> instructions on Arm. Say you want to do an atomic increment of a
> counter. You'd do an atomic load to get the counter into your local cache
> in exclusive state, increment that counter locally, then write that
> incremented counter back to memory with an atomic store. All the time
> that cache line was in exclusive state, so you're guaranteed that
> no-one else changed anything on that cache line while you had it.
> 
> This is hard to scale on a very large system (e.g. Fugaku) because if
> many processors are incrementing that counter you get a lot of cache
> line ping-ponging between cores.
> 
> So, Arm decided to add a locked memory increment instruction that
> works without needing to load an entire line into local cache. It's a
> single instruction that loads, increments, and writes back. The secret
> is to send a cache control message to whichever processor owns the
> cache line containing the count, tell that processor to increment the
> counter and return the incremented value. That way cache coherency
> traffic is mimimized. This new set of instructions is known as Large
> System Extensions, or LSE.
> 
> Unfortunately, in recent processors, the "old" load/store exclusive
> instructions, sometimes perform very badly. Therefore, it's now
> necessary for software to detect which version of Arm it's running
> on, and use the "new" LSE instructions if they're available. Otherwise
> performance can be very poor under heavy contention.
> 
> GCC's -moutline-atomics does this by providing library calls which use
> LSE if it's available, but this option is only provided on newer
> versions of GCC. This is particularly problematic with older versions
> of OpenJDK, which build using old GCC versions.
> 
> Also, I suspect that some other operating systems could use this.
> Perhaps not MacOS, given that all Apple CPUs support LSE, but
> maybe Windows.

With regard to performance, the overhead of the ```call ... ret``` sequence seems to be almost negligible on the systems I've tested. 

On ThunderX2, there is little difference, whatever you do. A straight-line count and increment loop is 5%v slower.

On Neoverse N1 there is some 25% straight-line improvement for a simple count and increment loop with this patch. GCC's -moutline-atomics isn't quite as good as this patch, with only a 17% improvement.

But simple straight-line tests aren't really the point of LSE. The big performance hit with the "old" atomics happens at times of heavy contention, when fairness problems cause severe scaling issues. This is more likely to be a problem on large systems with many cores and large heaps.

**ThunderX2:**

Baseline:

real	0m24.001s

Patched:

-XX:+UseLSE

real	0m25.222s

-XX:-UseLSE

real	0m25.215s

Built with -moutline-atomics:

real	0m25.227s

**Neoverse N1:**

Baseline:

real	0m10.027s

Patched:

-XX:+UseLSE

real	0m8.027s

-XX:-UseLSE

real	0m10.429s

Built with -moutline-atomics:

real	0m8.538s

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434