RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Mon Feb 8 09:55:00 UTC 2021

On Mon, 8 Feb 2021 08:35:49 GMT, Nick Gasson <ngasson at openjdk.org> wrote:

>> Go back a few years, and there were simple atomic load/store exclusive
>> instructions on Arm. Say you want to do an atomic increment of a
>> counter. You'd do an atomic load to get the counter into your local cache
>> in exclusive state, increment that counter locally, then write that
>> incremented counter back to memory with an atomic store. All the time
>> that cache line was in exclusive state, so you're guaranteed that
>> no-one else changed anything on that cache line while you had it.
>> 
>> This is hard to scale on a very large system (e.g. Fugaku) because if
>> many processors are incrementing that counter you get a lot of cache
>> line ping-ponging between cores.
>> 
>> So, Arm decided to add a locked memory increment instruction that
>> works without needing to load an entire line into local cache. It's a
>> single instruction that loads, increments, and writes back. The secret
>> is to send a cache control message to whichever processor owns the
>> cache line containing the count, tell that processor to increment the
>> counter and return the incremented value. That way cache coherency
>> traffic is mimimized. This new set of instructions is known as Large
>> System Extensions, or LSE.
>> 
>> Unfortunately, in recent processors, the "old" load/store exclusive
>> instructions, sometimes perform very badly. Therefore, it's now
>> necessary for software to detect which version of Arm it's running
>> on, and use the "new" LSE instructions if they're available. Otherwise
>> performance can be very poor under heavy contention.
>> 
>> GCC's -moutline-atomics does this by providing library calls which use
>> LSE if it's available, but this option is only provided on newer
>> versions of GCC. This is particularly problematic with older versions
>> of OpenJDK, which build using old GCC versions.
>> 
>> Also, I suspect that some other operating systems could use this.
>> Perhaps not MacOS, given that all Apple CPUs support LSE, but
>> maybe Windows.
>
> src/hotspot/cpu/aarch64/atomic_aarch64.S line 35:
> 
>> 33:         ret
>> 34: 
>> 35:         .globl aarch64_atomic_fetch_add_4_default_impl
> 
> The N1 optimisation guide suggests aligning branch targets on 32 byte boundaries.

OK.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5579:
> 
>> 5577:   //
>> 5578:   // If LSE is in use, generate LSE versions of all the stubs. The
>> 5579:   // non-LSE versions are in atomic_aarch64.S.
> 
> IMO it would be better for maintainability if the LSE versions were in atomic_aarch64.S too (with an explicit `.arch armv8-a+lse` directive). Is there any reason to generate them here, other than to support old toolchains? As far as I can tell GNU as supported LSE as far back as binutils 2.27.
> 
> https://sourceware.org/binutils/docs-2.27/as/AArch64-Extensions.html

I can't see any reason to do this.There's be no benefit to moving this stuff, and it would be harder to change in the future. I'd do the whole lot as runtime stubs if I could, but they're needed before VM startup.

> src/hotspot/cpu/aarch64/atomic_aarch64.S line 1:
> 
>> 1: // Copyright (c) 2021, Red Hat Inc. All rights reserved.
> 
> Does this file work with the Windows assembler?

I have no idea. If it doesn't, please tell me; I have no Windows system.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434