RFC: linux-aarch64 and LSE support

Tue Sep 20 11:15:27 UTC 2022

On 9/20/22 11:32, Kim Barrett wrote:
 >> On Sep 15, 2022, at 4:01 AM, Andrew Haley <aph-open at littlepinkcloud.com> wrote:
 >>
 >> On 9/15/22 01:16, Kim Barrett wrote:
 >>> I*think*  it can be done as a sequence of commits to revert prior commits,
 >>> followed by a couple of newly developed commits to reinstate features like
 >>> support for relaxed atomics.
 >>
 >> I expect so.
 >
 > The reversions are indeed relatively straightforward.  There is one set of
 > conflicts for one of the commits, due to unrelated nearby changes in one file,
 > but that's relatively easy to deal with.
 >
 > But...
 >
 > Digging into this deeper, I've run across some problems.  Of course, it's
 > about memory_order_conservative.

No surprise.

Let's start with what we know with a high degree of confidence: the
assembly code in atomic_linux_aarch64.S is correct.

 > I know about the discussion here:
 > https://mail.openjdk.org/pipermail/hotspot-dev/2019-November/039912.html
 > (long discussion about implementing memory_order_conservative)
 >
 > which refers to this:
 > https://patchwork.kernel.org/patch/3575821/
 > This is the reference leading to the pre-JDK-8261027 implementation.
 >
 > To summarize, for memory_order_conservative with ll/sc-style atomics
 > - For cmpxchg, use <fence> <relaxed-cmpxchg> <fence>

I think you found a bug in the pre-LSE code.

aarch64_atomic_cmpxchg_8_default_impl:
#ifdef __ARM_FEATURE_ATOMICS
         mov     x3, x1
         casal   x3, x2, [x0]
#else
         dmb     ish
         prfm    pstl1strm, [x0]
0:      ldxr    x3, [x0]
         cmp     x3, x1
         b.ne    1f
         stxr    w8, x2, [x0]
         cbnz    w8, 0b
#endif
1:      dmb     ish
         mov     x0, x3
         ret

This should be LDAXR;STLXR. CASAL is correct.

 > - For other ops, use <release-op> <fence>
 > That's what we were requesting before JDK-8261027, via __atomic intrinsics.
 > But -moutline-atomics changed things so we were potentially using the
 > corresponding LSE atomic instead of the ll/sc-style.
 >
 > For now I'm only going to discuss the other ops, not cmpxchg.  (cmpxchg is
 > it's own ball of ugliness, but I think I know how to deal with it.)
 >
 > 8261027 introduced support for using LSE atomics.  There are some surprises:
 >
 > (A) The default (ll/sc) implementations (atomic_linux_aarch64.S) are all
 > "acq-rel" rather than "release".

Good.

 > (B) The generated LSE stubs for "add" variants are "acq-rel" rather than
 > "release".

Good.

 >  Meanwhile, the generated stubs for "xchg" variants are "release".

Really? I see

   void gen_swpal_entry(Assembler::operand_size size) {
     Register prev = r2, addr = c_rarg0, incr = c_rarg1;
     __ swpal(size, incr, prev, addr);
     __ membar(Assembler::StoreStore|Assembler::StoreLoad);

That's acquire/release. What are you looking at?

 > (A) seems like a mistake.  The inconsistency in (B) also seems like a
 > mistake. The question of which one is correct takes us to the next change.
 >
 > 8261649 is intended to optimize the use of LSE atomics, though the bug only
 > talks about cmpxchg.  (It also fixes a problem with the timing of stub
 > generation vs possible use, but that doesn't matter for this discussion.)
 >
 > There is a big comment in front of the new stub generation code, talking about
 > how a acq-rel operation doesn't need a preceeding fence when using LSE
 > atomics.  I can see how that's very useful for cmpxchg.  (And the comment
 > mostly discusses cmpxchg.)  But I'm not certain of it's relevance for other
 > operations.

It's the same for all ops. None of them need a preceding fence.

 > Can non-cmpxchg operations still be implemented as <release-op> <fence> when
 > (potentially) using LSE instructions?  It seems like an argument similar to
 > the one for ll/sc could be made.  If so, then we can use __atomic (and
 > -moutline-atomics) to easily implement them.  If not, that's a somewhat
 > unpleasant semantic change going from ll/sc to LSE.  (And probably breaks the
 > current bsd_aarch64 implementation.)  It also gives us a couple of choices for
 > implementation:
 >
 > (C) Use <acq-rel-op> <fence> with the op generated using __atomic (and using
 > -moutline-atomics), in case we end up using LSE.

Yes, for conservative, whether we're using LSE or not.

 > This accepts the unneeded
 > acquire if using ll/sc as a necessary evil for simplicity of implementation.
 > (It also matches the current code, though generated differently.)
 >
 > (D) Avoid -moutline-atomics, doing our own use-LSE dispatch to either ll/sc or
 > LSE implementations.

No, please use GCC -moutline-atomics.

 > There are several ways to do this, including the approach taken by
 > the current code.
 >
 > So do we really need <acq-rel-op> <fence> when using LSE istructions?

For conservative, yes. Just use the C++ sequentially-consistent op,
followed by a full fence.

 > Or can we continue to use <release-op> <fence>?

No.

 >  I think there are also some problems with the big block comment for
 > the stub generation.  For example, I can't make sense of the
 > model-based test description.  For one thing, it talks about the
 > resulting X3 and X4, but those don't appear in the test code.

They do: W3 is the bottom half of X3.

 > (I'm also not getting the point about a lack of
 > barrier-ordered-after, but maybe I'm just confused.)

The lack of barrier-ordered-after is why we need a fence after all ops
for conservative.

(In fact we don't: all Arm hardware in existence works without the
trailing fence. And probably always will. But the Arm spec allows
reordering, even though available hardware doesn't. This is annoying,
but that's life.)

------------------------------------------------------------------------

In summary, this is what you should do:

Use GCC's sequentially-consistent atomics, plus a trailing full fence,
for conservative. This includes add, cmpxchg, etc.

Use GCC atomics without a trailing fence for relaxed, etc. Everything
except conservative.

------------------------------------------------------------------------

That should be all.

Thanks,

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671