From thomas.stuefe at gmail.com  Sat Mar 11 10:18:32 2023
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Sat, 11 Mar 2023 11:18:32 +0100
Subject: Question about CAS via LDREX/STREX on 32-bit arm
Message-ID: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>

Hi ARM experts,

I am trying to understand how CAS is implemented on arm; in particular,
"MacroAssembler::atomic_cas_bool":

MacroAssembler::atomic_cas_bool

```
    assert_different_registers(tmp_reg, oldval, newval, base);
    Label loop;
    bind(loop);
A   ldrex(tmp_reg, Address(base, offset));
B   subs(tmp_reg, tmp_reg, oldval);
C   strex(tmp_reg, newval, Address(base, offset), eq);
D   cmp(tmp_reg, 1, eq);
E   b(loop, eq);
F   cmp(tmp_reg, 0);
    if (tmpreg == noreg) {
      pop(tmp_reg);
    }
```

It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to
newval. It does so in a loop. The code distinguishes two failures: STREX
failing, and a "semantically failed" CAS.

Here is what I think this code does:

A) LDREX: tmp=*(base+offset)
B) tmp -= oldvalue
   If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
   After this, if the store succeeded, tmp_reg is 0, if the store failed
its 1.
D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
*(base+offset) had been modified before LDREX.
   We now compare with 1 and ...
E) ...repeat the loop if tmp_reg was 1

So we loop until either *(base+offset) had been changed to some other value
concurrently before out LDREX. Or until our store succeeded.

I wondered what the loop guards against. And why it would be okay sometimes
to omit it.

IIUC, STREX fails if the core did lose its exclusive access to the memory
location since the LDREX. This can be one of three things, right? :
- another core slipped in an LDREX+STREX to the same location between our
LDREX and STREX
- Or we context switched to another thread or process. I assume it does a
CLREX then, right? Because how could you prevent a sequence like "LDREX(p1)
-> switch -> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM
manual [1] correctly, a STREX to a different location than the preceding
LDREX is undefined.
- Or we had a signal after LDREX and did a second LDREX in the signal
handler. Does the kernel do a CLREX when invoking a signal handler?

More questions:

- If I got it right, at (D), tmp_reg value "1" has two meanings: either
STREX failed or some thread increased the value concurrently by 1. We
repeat the loop either way. Is this just accepted behavior? Increasing by 1
is maybe not that rare.

- If I understood this correctly, the loop guards us mainly against context
switches. Without the loop a context switch would count as a "semantically
failed" CAS. Why would that be okay? Should we not do this loop always?

- Do we not need an explicit CLREX after the operation? Or does the STREX
also clear the hardware monitor? Or does it just not matter?

- We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded
by this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes
ldrex/strex (e.g. the one-shot path of
MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now
generally available? Does ARMv6 mean STREX and LDREX are available?


Thanks a lot!

Cheers, Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230311/1e5aaf64/attachment.htm>

From david.holmes at oracle.com  Mon Mar 13 02:06:55 2023
From: david.holmes at oracle.com (David Holmes)
Date: Mon, 13 Mar 2023 12:06:55 +1000
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
Message-ID: <455cb951-fbe4-c3dd-02f2-a0a9f058aa73@oracle.com>

Hi Thomas,

I'm far too rusty of the details to answer most of your questions but:

 > - Do we not need an explicit CLREX after the operation? Or does the
 > STREX also clear the hardware monitor? Or does it just not matter?

STREX clears the reservation  so CLREX is not needed. From Arm ARM:

"A Load-Exclusive instruction marks a small block of memory for 
exclusive access. The size of the marked block is IMPLEMENTATION 
DEFINED, see Marking and the size of the marked memory block on page 
B2-105. A Store-Exclusive instruction to any address in the marked block 
clears the marking."

 > - We have VM_Version::supports_ldrex(). Code seems to me sometimes
 > guarded by this (e.g MacroAssembler::atomic_cas_bool), sometimes code
 > just executes ldrex/strex (e.g. the one-shot path of
 > MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now
 > generally available? Does ARMv6 mean STREX and LDREX are available?

I think you will find that where the ldrex guard is missing, the code is 
for C2 only and a C2 build is only possible if ldrex is available. (C2 
was only supported on ARMv7+.).

Also the ARM conditional instructions as used in atomic_cas_bool can 
cause confusion when trying to understand the logic. :)

Cheers,
David
-----


On 11/03/2023 8:18 pm, Thomas St?fe wrote:
> Hi ARM experts,
> 
> I am trying to understand how CAS is implemented on arm; in particular, 
> "MacroAssembler::atomic_cas_bool":
> 
> MacroAssembler::atomic_cas_bool
> 
> ```
>  ? ? assert_different_registers(tmp_reg, oldval, newval, base);
>  ? ? Label loop;
>  ? ? bind(loop);
> A ? ldrex(tmp_reg, Address(base, offset));
> B ? subs(tmp_reg, tmp_reg, oldval);
> C ? strex(tmp_reg, newval, Address(base, offset), eq);
> D ? cmp(tmp_reg, 1, eq);
> E ? b(loop, eq);
> F ? cmp(tmp_reg, 0);
>  ? ? if (tmpreg == noreg) {
>  ? ? ? pop(tmp_reg);
>  ? ? }
> ```
> 
> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval 
> to newval. It does so in a loop. The code distinguishes two failures: 
> STREX failing, and a "semantically failed" CAS.
> 
> Here is what I think this code does:
> 
> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>  ? ?If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>  ? ?After this, if the store succeeded, tmp_reg is 0, if the store 
> failed its 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if 
> *(base+offset) had been modified before LDREX.
>  ? ?We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1
> 
> So we loop until either *(base+offset) had been changed to some other 
> value concurrently before out LDREX. Or until our store succeeded.
> 
> I wondered what the loop guards against. And why it would be okay 
> sometimes to omit it.
> 
> IIUC, STREX fails if the core did lose its exclusive access to the 
> memory location since the LDREX. This can be one of three things, right? :
> - another core slipped in an LDREX+STREX to the same location between 
> our LDREX and STREX
> - Or we context switched to another thread or process. I assume it does 
> a CLREX then, right? Because how could you prevent a sequence like 
> "LDREX(p1) -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I 
> understand the ARM manual [1] correctly, a STREX to a different location 
> than the preceding LDREX is undefined.
> - Or we had a signal after LDREX and did a second LDREX in the signal 
> handler. Does the kernel do a CLREX when invoking a signal handler?
> 
> More questions:
> 
> - If I got it right, at (D), tmp_reg value "1" has two meanings: either 
> STREX failed or some thread increased the value concurrently by 1. We 
> repeat the loop either way. Is this just accepted behavior? Increasing 
> by 1 is maybe not that rare.
> 
> - If I understood this correctly, the loop guards us mainly against 
> context switches. Without the loop a context switch would count as a 
> "semantically failed" CAS. Why would that be okay? Should we not do this 
> loop always?
> 
> - Do we not need an explicit CLREX after the operation? Or does the 
> STREX also clear the hardware monitor? Or does it just not matter?
> 
> - We have VM_Version::supports_ldrex(). Code seems to me sometimes 
> guarded by this (e.g MacroAssembler::atomic_cas_bool), sometimes code 
> just executes ldrex/strex (e.g. the one-shot path of 
> MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now 
> generally available? Does ARMv6 mean STREX and LDREX are available?
> 
> 
> Thanks a lot!
> 
> Cheers, Thomas
> 
> 
> 

From thomas.stuefe at gmail.com  Mon Mar 13 05:35:23 2023
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Mon, 13 Mar 2023 06:35:23 +0100
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <455cb951-fbe4-c3dd-02f2-a0a9f058aa73@oracle.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
 <455cb951-fbe4-c3dd-02f2-a0a9f058aa73@oracle.com>
Message-ID: <CAA-vtUziXR6fbQGstvEMfKHCQy3Q2WJ4uwxNAKxiYbqavOC+xw@mail.gmail.com>

Hi David,


On Mon, Mar 13, 2023 at 3:07?AM David Holmes <david.holmes at oracle.com>
wrote:

> Hi Thomas,
>
> I'm far too rusty of the details to answer most of your questions but:
>
>  > - Do we not need an explicit CLREX after the operation? Or does the
>  > STREX also clear the hardware monitor? Or does it just not matter?
>
> STREX clears the reservation  so CLREX is not needed. From Arm ARM:
>
> "A Load-Exclusive instruction marks a small block of memory for
> exclusive access. The size of the marked block is IMPLEMENTATION
> DEFINED, see Marking and the size of the marked memory block on page
> B2-105. A Store-Exclusive instruction to any address in the marked block
> clears the marking."
>
>  > - We have VM_Version::supports_ldrex(). Code seems to me sometimes
>  > guarded by this (e.g MacroAssembler::atomic_cas_bool), sometimes code
>  > just executes ldrex/strex (e.g. the one-shot path of
>  > MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now
>  > generally available? Does ARMv6 mean STREX and LDREX are available?
>
> I think you will find that where the ldrex guard is missing, the code is
> for C2 only and a C2 build is only possible if ldrex is available. (C2
> was only supported on ARMv7+.).
>
> Also the ARM conditional instructions as used in atomic_cas_bool can
> cause confusion when trying to understand the logic. :)
>
>
Thank you, that already helps! Not that rusty, it seems :-)

Cheers, Thomas

Cheers,
> David
> -----
>
>
> On 11/03/2023 8:18 pm, Thomas St?fe wrote:
> > Hi ARM experts,
> >
> > I am trying to understand how CAS is implemented on arm; in particular,
> > "MacroAssembler::atomic_cas_bool":
> >
> > MacroAssembler::atomic_cas_bool
> >
> > ```
> >      assert_different_registers(tmp_reg, oldval, newval, base);
> >      Label loop;
> >      bind(loop);
> > A   ldrex(tmp_reg, Address(base, offset));
> > B   subs(tmp_reg, tmp_reg, oldval);
> > C   strex(tmp_reg, newval, Address(base, offset), eq);
> > D   cmp(tmp_reg, 1, eq);
> > E   b(loop, eq);
> > F   cmp(tmp_reg, 0);
> >      if (tmpreg == noreg) {
> >        pop(tmp_reg);
> >      }
> > ```
> >
> > It uses LDREX and STREX to perform a cas of *(base+offset) from oldval
> > to newval. It does so in a loop. The code distinguishes two failures:
> > STREX failing, and a "semantically failed" CAS.
> >
> > Here is what I think this code does:
> >
> > A) LDREX: tmp=*(base+offset)
> > B) tmp -= oldvalue
> >     If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> > C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise,
> omit.
> >     After this, if the store succeeded, tmp_reg is 0, if the store
> > failed its 1.
> > D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
> > *(base+offset) had been modified before LDREX.
> >     We now compare with 1 and ...
> > E) ...repeat the loop if tmp_reg was 1
> >
> > So we loop until either *(base+offset) had been changed to some other
> > value concurrently before out LDREX. Or until our store succeeded.
> >
> > I wondered what the loop guards against. And why it would be okay
> > sometimes to omit it.
> >
> > IIUC, STREX fails if the core did lose its exclusive access to the
> > memory location since the LDREX. This can be one of three things, right?
> :
> > - another core slipped in an LDREX+STREX to the same location between
> > our LDREX and STREX
> > - Or we context switched to another thread or process. I assume it does
> > a CLREX then, right? Because how could you prevent a sequence like
> > "LDREX(p1) -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I
> > understand the ARM manual [1] correctly, a STREX to a different location
> > than the preceding LDREX is undefined.
> > - Or we had a signal after LDREX and did a second LDREX in the signal
> > handler. Does the kernel do a CLREX when invoking a signal handler?
> >
> > More questions:
> >
> > - If I got it right, at (D), tmp_reg value "1" has two meanings: either
> > STREX failed or some thread increased the value concurrently by 1. We
> > repeat the loop either way. Is this just accepted behavior? Increasing
> > by 1 is maybe not that rare.
> >
> > - If I understood this correctly, the loop guards us mainly against
> > context switches. Without the loop a context switch would count as a
> > "semantically failed" CAS. Why would that be okay? Should we not do this
> > loop always?
> >
> > - Do we not need an explicit CLREX after the operation? Or does the
> > STREX also clear the hardware monitor? Or does it just not matter?
> >
> > - We have VM_Version::supports_ldrex(). Code seems to me sometimes
> > guarded by this (e.g MacroAssembler::atomic_cas_bool), sometimes code
> > just executes ldrex/strex (e.g. the one-shot path of
> > MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now
> > generally available? Does ARMv6 mean STREX and LDREX are available?
> >
> >
> > Thanks a lot!
> >
> > Cheers, Thomas
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230313/b724dbf7/attachment-0001.htm>

From richard.reingruber at sap.com  Mon Mar 13 09:02:09 2023
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Mon, 13 Mar 2023 09:02:09 +0000
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
Message-ID: <AS4PR02MB8504AFA684AFA2CB27131D8D9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>

> Hi ARM experts,

Hi Thomas, not at all an ARM expert... :)
but I think I understand the code.

> I am trying to understand how CAS is implemented on arm; in particular, "MacroAssembler::atomic_cas_bool":

> MacroAssembler::atomic_cas_bool

> ```
>     assert_different_registers(tmp_reg, oldval, newval, base);
>     Label loop;
>     bind(loop);
> A   ldrex(tmp_reg, Address(base, offset));
> B   subs(tmp_reg, tmp_reg, oldval);
> C   strex(tmp_reg, newval, Address(base, offset), eq);
> D   cmp(tmp_reg, 1, eq);
> E   b(loop, eq);
> F   cmp(tmp_reg, 0);
>     if (tmpreg == noreg) {
>       pop(tmp_reg);
>     }
> ```

> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to newval. It does so in a loop. The code distinguishes two failures: STREX failing, and a "semantically failed" CAS.

> Here is what I think this code does:

> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>    After this, if the store succeeded, tmp_reg is 0, if the store failed its 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if *(base+offset) had been modified before LDREX.
>    We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1

> So we loop until either *(base+offset) had been changed to some other value concurrently before out LDREX. Or until our store succeeded.

> I wondered what the loop guards against. And why it would be okay sometimes to omit it.

The loop is needed to try again if the reservation was lost until the STREX succeeds or *(base+offset) != oldvalue.

So there are two cases. The loop is left iff

(1) *(base+offset) != oldvalue
(2) the STREX succeeded

First it is important to understand that C, D, E are only executed if at B the eq-condition is set to true.
This is based on the "Conditional Execution" feature of ARM: execution of most instructions can be made dependent on a condition (see https://developer.arm.com/documentation/den0013/d/ARM-Thumb-Unified-Assembly-Language-Instructions/Instruction-set-basics/Conditional-execution?lang=en)

So in case (1) C, D, E are not executed because B indicates that *(base+offset) and oldvalue are not-eq and the loop is left.

In case (2) C, D, E are executed. At C, if the reservation from A still exists, tmp_reg will be set to 0 otherwise to 1. At E the branch is taken if D indicated tmp_reg == 1 (reservation was lost) otherwise the loop is left.

Cheers, Richard.

From: porters-dev <porters-dev-retn at openjdk.org> on behalf of Thomas St?fe <thomas.stuefe at gmail.com>
Date: Saturday, 11. March 2023 at 11:19
To: porters-dev at openjdk.org <porters-dev at openjdk.org>, aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
Subject: Question about CAS via LDREX/STREX on 32-bit arm
Hi ARM experts,

I am trying to understand how CAS is implemented on arm; in particular, "MacroAssembler::atomic_cas_bool":

MacroAssembler::atomic_cas_bool

```
    assert_different_registers(tmp_reg, oldval, newval, base);
    Label loop;
    bind(loop);
A   ldrex(tmp_reg, Address(base, offset));
B   subs(tmp_reg, tmp_reg, oldval);
C   strex(tmp_reg, newval, Address(base, offset), eq);
D   cmp(tmp_reg, 1, eq);
E   b(loop, eq);
F   cmp(tmp_reg, 0);
    if (tmpreg == noreg) {
      pop(tmp_reg);
    }
```

It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to newval. It does so in a loop. The code distinguishes two failures: STREX failing, and a "semantically failed" CAS.
Here is what I think this code does:

A) LDREX: tmp=*(base+offset)
B) tmp -= oldvalue
   If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
   After this, if the store succeeded, tmp_reg is 0, if the store failed its 1.
D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if *(base+offset) had been modified before LDREX.
   We now compare with 1 and ...
E) ...repeat the loop if tmp_reg was 1

So we loop until either *(base+offset) had been changed to some other value concurrently before out LDREX. Or until our store succeeded.

I wondered what the loop guards against. And why it would be okay sometimes to omit it.

IIUC, STREX fails if the core did lose its exclusive access to the memory location since the LDREX. This can be one of three things, right? :
- another core slipped in an LDREX+STREX to the same location between our LDREX and STREX
- Or we context switched to another thread or process. I assume it does a CLREX then, right? Because how could you prevent a sequence like "LDREX(p1) -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM manual [1] correctly, a STREX to a different location than the preceding LDREX is undefined.
- Or we had a signal after LDREX and did a second LDREX in the signal handler. Does the kernel do a CLREX when invoking a signal handler?

More questions:

- If I got it right, at (D), tmp_reg value "1" has two meanings: either STREX failed or some thread increased the value concurrently by 1. We repeat the loop either way. Is this just accepted behavior? Increasing by 1 is maybe not that rare.

- If I understood this correctly, the loop guards us mainly against context switches. Without the loop a context switch would count as a "semantically failed" CAS. Why would that be okay? Should we not do this loop always?

- Do we not need an explicit CLREX after the operation? Or does the STREX also clear the hardware monitor? Or does it just not matter?

- We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded by this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes ldrex/strex (e.g. the one-shot path of MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now generally available? Does ARMv6 mean STREX and LDREX are available?


Thanks a lot!

Cheers, Thomas


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230313/25855148/attachment-0001.htm>

From thomas.stuefe at gmail.com  Mon Mar 13 09:54:14 2023
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Mon, 13 Mar 2023 10:54:14 +0100
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <AS4PR02MB8504AFA684AFA2CB27131D8D9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
 <AS4PR02MB8504AFA684AFA2CB27131D8D9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
Message-ID: <CAA-vtUxbZe+fG2-h=Kz8=1OhyQ+P1WLGb-yAYea=fRjt0sXKQQ@mail.gmail.com>

Hi Richard :)


On Mon, Mar 13, 2023 at 10:02?AM Reingruber, Richard <
richard.reingruber at sap.com> wrote:

> > Hi ARM experts,
>
>
>
> Hi Thomas, not at all an ARM expert... :)
>
> but I think I understand the code.
>
>
>
> > I am trying to understand how CAS is implemented on arm; in particular,
> "MacroAssembler::atomic_cas_bool":
>
>
>
> > MacroAssembler::atomic_cas_bool
>
>
>
> > ```
>
> >     assert_different_registers(tmp_reg, oldval, newval, base);
>
> >     Label loop;
>
> >     bind(loop);
>
> > A   ldrex(tmp_reg, Address(base, offset));
>
> > B   subs(tmp_reg, tmp_reg, oldval);
>
> > C   strex(tmp_reg, newval, Address(base, offset), eq);
>
> > D   cmp(tmp_reg, 1, eq);
>
> > E   b(loop, eq);
>
> > F   cmp(tmp_reg, 0);
>
> >     if (tmpreg == noreg) {
>
> >       pop(tmp_reg);
>
> >     }
>
> > ```
>
>
>
> > It uses LDREX and STREX to perform a cas of *(base+offset) from oldval
> to newval. It does so in a loop. The code distinguishes two failures: STREX
> failing, and a "semantically failed" CAS.
>
>
>
> > Here is what I think this code does:
>
>
>
> > A) LDREX: tmp=*(base+offset)
>
> > B) tmp -= oldvalue
>
> >    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
>
> > C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise,
> omit.
>
> >    After this, if the store succeeded, tmp_reg is 0, if the store failed
> its 1.
>
> > D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
> *(base+offset) had been modified before LDREX.
>
> >    We now compare with 1 and ...
>
> > E) ...repeat the loop if tmp_reg was 1
>
>
>
> > So we loop until either *(base+offset) had been changed to some other
> value concurrently before out LDREX. Or until our store succeeded.
>
>
>
> > I wondered what the loop guards against. And why it would be okay
> sometimes to omit it.
>
>
>
> The loop is needed to try again if the reservation was lost until the
> STREX succeeds or *(base+offset) != oldvalue.
>
>
>
> So there are two cases. The loop is left iff
>
>
>
> (1) *(base+offset) != oldvalue
>
> (2) the STREX succeeded
>
>
>
> First it is important to understand that C, D, E are only executed if at
> B the eq-condition is set to true.
>
> This is based on the "Conditional Execution" feature of ARM: execution of
> most instructions can be made dependent on a condition (see
> https://developer.arm.com/documentation/den0013/d/ARM-Thumb-Unified-Assembly-Language-Instructions/Instruction-set-basics/Conditional-execution?lang=en
> )
>
>
>
> So in case (1) C, D, E are not executed because B indicates that
> *(base+offset) and oldvalue are not-eq and the loop is left.
>
>
>

Ah, thanks, that was my thinking error. I did not realize that CMP was also
conditional. I assumed the "eq" in the CMP (D) was a condition for the CMP.
Which makes no sense, as I know now, since CMP just does a sub and only
needs two arguments. So that meant the full branch CDE was controlled from
the subtraction result at B.

That also resolves the "1 has a double meaning" question. It hasn't.


> In case (2) C, D, E are executed. At C, if the reservation from A still
> exists, tmp_reg will be set to 0 otherwise to 1. At E the branch is taken
> if D indicated tmp_reg == 1 (reservation was lost) otherwise the loop is
> left.
>

Yes, I read it the same way. So we repeat the CAS for lost reservation. I'm
interested in when this could happen and why it would be okay to sometimes
omit this loop and do the "raw" LDREX-STREX sequence. See my original mail.

I suspect it has something to do with context switches. That the kernel
does a CLREX when we switch, so if we switch between LDREX and STREX, the
reservation could be lost. But why would it then be okay to ignore this
sometimes?

Thanks!

Thomas


>
> Cheers, Richard.
>
>
>
> *From: *porters-dev <porters-dev-retn at openjdk.org> on behalf of Thomas
> St?fe <thomas.stuefe at gmail.com>
> *Date: *Saturday, 11. March 2023 at 11:19
> *To: *porters-dev at openjdk.org <porters-dev at openjdk.org>,
> aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
> *Subject: *Question about CAS via LDREX/STREX on 32-bit arm
>
> Hi ARM experts,
>
> I am trying to understand how CAS is implemented on arm; in particular,
> "MacroAssembler::atomic_cas_bool":
>
> MacroAssembler::atomic_cas_bool
>
>
>
> ```
>
>     assert_different_registers(tmp_reg, oldval, newval, base);
>     Label loop;
>     bind(loop);
> A   ldrex(tmp_reg, Address(base, offset));
> B   subs(tmp_reg, tmp_reg, oldval);
> C   strex(tmp_reg, newval, Address(base, offset), eq);
> D   cmp(tmp_reg, 1, eq);
> E   b(loop, eq);
> F   cmp(tmp_reg, 0);
>     if (tmpreg == noreg) {
>       pop(tmp_reg);
>     }
> ```
>
> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to
> newval. It does so in a loop. The code distinguishes two failures: STREX
> failing, and a "semantically failed" CAS.
>
> Here is what I think this code does:
>
>
>
> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>    After this, if the store succeeded, tmp_reg is 0, if the store failed
> its 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
> *(base+offset) had been modified before LDREX.
>    We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1
>
> So we loop until either *(base+offset) had been changed to some other
> value concurrently before out LDREX. Or until our store succeeded.
>
> I wondered what the loop guards against. And why it would be okay
> sometimes to omit it.
>
> IIUC, STREX fails if the core did lose its exclusive access to the memory
> location since the LDREX. This can be one of three things, right? :
> - another core slipped in an LDREX+STREX to the same location between our
> LDREX and STREX
> - Or we context switched to another thread or process. I assume it does a
> CLREX then, right? Because how could you prevent a sequence like "LDREX(p1)
> -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM
> manual [1] correctly, a STREX to a different location than the preceding
> LDREX is undefined.
> - Or we had a signal after LDREX and did a second LDREX in the signal
> handler. Does the kernel do a CLREX when invoking a signal handler?
>
> More questions:
>
> - If I got it right, at (D), tmp_reg value "1" has two meanings: either
> STREX failed or some thread increased the value concurrently by 1. We
> repeat the loop either way. Is this just accepted behavior? Increasing by 1
> is maybe not that rare.
>
> - If I understood this correctly, the loop guards us mainly against
> context switches. Without the loop a context switch would count as a
> "semantically failed" CAS. Why would that be okay? Should we not do this
> loop always?
>
> - Do we not need an explicit CLREX after the operation? Or does the STREX
> also clear the hardware monitor? Or does it just not matter?
>
> - We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded
> by this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes
> ldrex/strex (e.g. the one-shot path of
> MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now
> generally available? Does ARMv6 mean STREX and LDREX are available?
>
>
> Thanks a lot!
>
> Cheers, Thomas
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230313/7adcf7de/attachment.htm>

From richard.reingruber at sap.com  Mon Mar 13 10:03:11 2023
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Mon, 13 Mar 2023 10:03:11 +0000
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <CAA-vtUxbZe+fG2-h=Kz8=1OhyQ+P1WLGb-yAYea=fRjt0sXKQQ@mail.gmail.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
 <AS4PR02MB8504AFA684AFA2CB27131D8D9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
 <CAA-vtUxbZe+fG2-h=Kz8=1OhyQ+P1WLGb-yAYea=fRjt0sXKQQ@mail.gmail.com>
Message-ID: <AS4PR02MB8504D804BCE39E4FC24E08BF9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>

> Yes, I read it the same way. So we repeat the CAS for lost reservation. I'm
> interested in when this could happen and why it would be okay to sometimes
> omit this loop and do the "raw" LDREX-STREX sequence. See my original mail.

Hm, I don't understand. The loop in the sequence A-F is always there. How is it omitted?


From: Thomas St?fe <thomas.stuefe at gmail.com>
Date: Monday, 13. March 2023 at 10:54
To: Reingruber, Richard <richard.reingruber at sap.com>
Cc: porters-dev at openjdk.org <porters-dev at openjdk.org>, aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
Subject: Re: Question about CAS via LDREX/STREX on 32-bit arm
Hi Richard :)


On Mon, Mar 13, 2023 at 10:02?AM Reingruber, Richard <richard.reingruber at sap.com<mailto:richard.reingruber at sap.com>> wrote:
> Hi ARM experts,

Hi Thomas, not at all an ARM expert... :)
but I think I understand the code.

> I am trying to understand how CAS is implemented on arm; in particular, "MacroAssembler::atomic_cas_bool":

> MacroAssembler::atomic_cas_bool

> ```
>     assert_different_registers(tmp_reg, oldval, newval, base);
>     Label loop;
>     bind(loop);
> A   ldrex(tmp_reg, Address(base, offset));
> B   subs(tmp_reg, tmp_reg, oldval);
> C   strex(tmp_reg, newval, Address(base, offset), eq);
> D   cmp(tmp_reg, 1, eq);
> E   b(loop, eq);
> F   cmp(tmp_reg, 0);
>     if (tmpreg == noreg) {
>       pop(tmp_reg);
>     }
> ```

> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to newval. It does so in a loop. The code distinguishes two failures: STREX failing, and a "semantically failed" CAS.

> Here is what I think this code does:

> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>    After this, if the store succeeded, tmp_reg is 0, if the store failed its 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if *(base+offset) had been modified before LDREX.
>    We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1

> So we loop until either *(base+offset) had been changed to some other value concurrently before out LDREX. Or until our store succeeded.

> I wondered what the loop guards against. And why it would be okay sometimes to omit it.

The loop is needed to try again if the reservation was lost until the STREX succeeds or *(base+offset) != oldvalue.

So there are two cases. The loop is left iff

(1) *(base+offset) != oldvalue
(2) the STREX succeeded

First it is important to understand that C, D, E are only executed if at B the eq-condition is set to true.
This is based on the "Conditional Execution" feature of ARM: execution of most instructions can be made dependent on a condition (see https://developer.arm.com/documentation/den0013/d/ARM-Thumb-Unified-Assembly-Language-Instructions/Instruction-set-basics/Conditional-execution?lang=en<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.arm.com%2Fdocumentation%2Fden0013%2Fd%2FARM-Thumb-Unified-Assembly-Language-Instructions%2FInstruction-set-basics%2FConditional-execution%3Flang%3Den&data=05%7C01%7Crichard.reingruber%40sap.com%7C5e217cf5139c4ca453b608db23a8ef6e%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638142980698821044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gBV1sR3DFdVAklgjVRdAdb7oIEv8Zwj%2FasRfwEB10vQ%3D&reserved=0>)

So in case (1) C, D, E are not executed because B indicates that *(base+offset) and oldvalue are not-eq and the loop is left.


Ah, thanks, that was my thinking error. I did not realize that CMP was also conditional. I assumed the "eq" in the CMP (D) was a condition for the CMP. Which makes no sense, as I know now, since CMP just does a sub and only needs two arguments. So that meant the full branch CDE was controlled from the subtraction result at B.

That also resolves the "1 has a double meaning" question. It hasn't.

In case (2) C, D, E are executed. At C, if the reservation from A still exists, tmp_reg will be set to 0 otherwise to 1. At E the branch is taken if D indicated tmp_reg == 1 (reservation was lost) otherwise the loop is left.

Yes, I read it the same way. So we repeat the CAS for lost reservation. I'm interested in when this could happen and why it would be okay to sometimes omit this loop and do the "raw" LDREX-STREX sequence. See my original mail.

I suspect it has something to do with context switches. That the kernel does a CLREX when we switch, so if we switch between LDREX and STREX, the reservation could be lost. But why would it then be okay to ignore this sometimes?

Thanks!

Thomas


Cheers, Richard.

From: porters-dev <porters-dev-retn at openjdk.org<mailto:porters-dev-retn at openjdk.org>> on behalf of Thomas St?fe <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>>
Date: Saturday, 11. March 2023 at 11:19
To: porters-dev at openjdk.org<mailto:porters-dev at openjdk.org> <porters-dev at openjdk.org<mailto:porters-dev at openjdk.org>>, aarch32-port-dev at openjdk.org<mailto:aarch32-port-dev at openjdk.org> <aarch32-port-dev at openjdk.org<mailto:aarch32-port-dev at openjdk.org>>
Subject: Question about CAS via LDREX/STREX on 32-bit arm
Hi ARM experts,

I am trying to understand how CAS is implemented on arm; in particular, "MacroAssembler::atomic_cas_bool":

MacroAssembler::atomic_cas_bool

```
    assert_different_registers(tmp_reg, oldval, newval, base);
    Label loop;
    bind(loop);
A   ldrex(tmp_reg, Address(base, offset));
B   subs(tmp_reg, tmp_reg, oldval);
C   strex(tmp_reg, newval, Address(base, offset), eq);
D   cmp(tmp_reg, 1, eq);
E   b(loop, eq);
F   cmp(tmp_reg, 0);
    if (tmpreg == noreg) {
      pop(tmp_reg);
    }
```

It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to newval. It does so in a loop. The code distinguishes two failures: STREX failing, and a "semantically failed" CAS.
Here is what I think this code does:

A) LDREX: tmp=*(base+offset)
B) tmp -= oldvalue
   If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
   After this, if the store succeeded, tmp_reg is 0, if the store failed its 1.
D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if *(base+offset) had been modified before LDREX.
   We now compare with 1 and ...
E) ...repeat the loop if tmp_reg was 1

So we loop until either *(base+offset) had been changed to some other value concurrently before out LDREX. Or until our store succeeded.

I wondered what the loop guards against. And why it would be okay sometimes to omit it.

IIUC, STREX fails if the core did lose its exclusive access to the memory location since the LDREX. This can be one of three things, right? :
- another core slipped in an LDREX+STREX to the same location between our LDREX and STREX
- Or we context switched to another thread or process. I assume it does a CLREX then, right? Because how could you prevent a sequence like "LDREX(p1) -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM manual [1] correctly, a STREX to a different location than the preceding LDREX is undefined.
- Or we had a signal after LDREX and did a second LDREX in the signal handler. Does the kernel do a CLREX when invoking a signal handler?

More questions:

- If I got it right, at (D), tmp_reg value "1" has two meanings: either STREX failed or some thread increased the value concurrently by 1. We repeat the loop either way. Is this just accepted behavior? Increasing by 1 is maybe not that rare.

- If I understood this correctly, the loop guards us mainly against context switches. Without the loop a context switch would count as a "semantically failed" CAS. Why would that be okay? Should we not do this loop always?

- Do we not need an explicit CLREX after the operation? Or does the STREX also clear the hardware monitor? Or does it just not matter?

- We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded by this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes ldrex/strex (e.g. the one-shot path of MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now generally available? Does ARMv6 mean STREX and LDREX are available?


Thanks a lot!

Cheers, Thomas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230313/f39d6f12/attachment-0001.htm>

From thomas.stuefe at gmail.com  Mon Mar 13 10:07:31 2023
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Mon, 13 Mar 2023 11:07:31 +0100
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <AS4PR02MB8504D804BCE39E4FC24E08BF9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
 <AS4PR02MB8504AFA684AFA2CB27131D8D9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
 <CAA-vtUxbZe+fG2-h=Kz8=1OhyQ+P1WLGb-yAYea=fRjt0sXKQQ@mail.gmail.com>
 <AS4PR02MB8504D804BCE39E4FC24E08BF9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
Message-ID: <CAA-vtUyf1LGjYDhEOg=SB94n-Tq7zTDr2qYAZUt-pFAgJ+-HNA@mail.gmail.com>

This whole coding lives in MacroAssembler::atomic_cas_bool, but that
function is not always called. There are plenty of direct usages of
ldrex+strex. There is even a condition argument for
MacroAssembler::cas_for_lock_acquire to either do strex directly or to dive
down into atomic_cas_bool.

On Mon, Mar 13, 2023 at 11:03?AM Reingruber, Richard <
richard.reingruber at sap.com> wrote:

> > Yes, I read it the same way. So we repeat the CAS for lost reservation.
> I'm
>
> > interested in when this could happen and why it would be okay to
> sometimes
>
> > omit this loop and do the "raw" LDREX-STREX sequence. See my original
> mail.
>
>
>
> Hm, I don't understand. The loop in the sequence A-F is always there. How
> is it omitted?
>
>
>
>
>
> *From: *Thomas St?fe <thomas.stuefe at gmail.com>
> *Date: *Monday, 13. March 2023 at 10:54
> *To: *Reingruber, Richard <richard.reingruber at sap.com>
> *Cc: *porters-dev at openjdk.org <porters-dev at openjdk.org>,
> aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
> *Subject: *Re: Question about CAS via LDREX/STREX on 32-bit arm
>
> Hi Richard :)
>
>
>
>
>
> On Mon, Mar 13, 2023 at 10:02?AM Reingruber, Richard <
> richard.reingruber at sap.com> wrote:
>
> > Hi ARM experts,
>
>
>
> Hi Thomas, not at all an ARM expert... :)
>
> but I think I understand the code.
>
>
>
> > I am trying to understand how CAS is implemented on arm; in particular,
> "MacroAssembler::atomic_cas_bool":
>
>
>
> > MacroAssembler::atomic_cas_bool
>
>
>
> > ```
>
> >     assert_different_registers(tmp_reg, oldval, newval, base);
>
> >     Label loop;
>
> >     bind(loop);
>
> > A   ldrex(tmp_reg, Address(base, offset));
>
> > B   subs(tmp_reg, tmp_reg, oldval);
>
> > C   strex(tmp_reg, newval, Address(base, offset), eq);
>
> > D   cmp(tmp_reg, 1, eq);
>
> > E   b(loop, eq);
>
> > F   cmp(tmp_reg, 0);
>
> >     if (tmpreg == noreg) {
>
> >       pop(tmp_reg);
>
> >     }
>
> > ```
>
>
>
> > It uses LDREX and STREX to perform a cas of *(base+offset) from oldval
> to newval. It does so in a loop. The code distinguishes two failures: STREX
> failing, and a "semantically failed" CAS.
>
>
>
> > Here is what I think this code does:
>
>
>
> > A) LDREX: tmp=*(base+offset)
>
> > B) tmp -= oldvalue
>
> >    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
>
> > C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise,
> omit.
>
> >    After this, if the store succeeded, tmp_reg is 0, if the store failed
> its 1.
>
> > D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
> *(base+offset) had been modified before LDREX.
>
> >    We now compare with 1 and ...
>
> > E) ...repeat the loop if tmp_reg was 1
>
>
>
> > So we loop until either *(base+offset) had been changed to some other
> value concurrently before out LDREX. Or until our store succeeded.
>
>
>
> > I wondered what the loop guards against. And why it would be okay
> sometimes to omit it.
>
>
>
> The loop is needed to try again if the reservation was lost until the
> STREX succeeds or *(base+offset) != oldvalue.
>
>
>
> So there are two cases. The loop is left iff
>
>
>
> (1) *(base+offset) != oldvalue
>
> (2) the STREX succeeded
>
>
>
> First it is important to understand that C, D, E are only executed if at
> B the eq-condition is set to true.
>
> This is based on the "Conditional Execution" feature of ARM: execution of
> most instructions can be made dependent on a condition (see
> https://developer.arm.com/documentation/den0013/d/ARM-Thumb-Unified-Assembly-Language-Instructions/Instruction-set-basics/Conditional-execution?lang=en
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.arm.com%2Fdocumentation%2Fden0013%2Fd%2FARM-Thumb-Unified-Assembly-Language-Instructions%2FInstruction-set-basics%2FConditional-execution%3Flang%3Den&data=05%7C01%7Crichard.reingruber%40sap.com%7C5e217cf5139c4ca453b608db23a8ef6e%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638142980698821044%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=gBV1sR3DFdVAklgjVRdAdb7oIEv8Zwj%2FasRfwEB10vQ%3D&reserved=0>
> )
>
>
>
> So in case (1) C, D, E are not executed because B indicates that
> *(base+offset) and oldvalue are not-eq and the loop is left.
>
>
>
>
>
> Ah, thanks, that was my thinking error. I did not realize that CMP was
> also conditional. I assumed the "eq" in the CMP (D) was a condition for the
> CMP. Which makes no sense, as I know now, since CMP just does a sub and
> only needs two arguments. So that meant the full branch CDE was controlled
> from the subtraction result at B.
>
>
>
> That also resolves the "1 has a double meaning" question. It hasn't.
>
>
>
> In case (2) C, D, E are executed. At C, if the reservation from A still
> exists, tmp_reg will be set to 0 otherwise to 1. At E the branch is taken
> if D indicated tmp_reg == 1 (reservation was lost) otherwise the loop is
> left.
>
>
>
> Yes, I read it the same way. So we repeat the CAS for lost reservation.
> I'm interested in when this could happen and why it would be okay to
> sometimes omit this loop and do the "raw" LDREX-STREX sequence. See my
> original mail.
>
>
>
> I suspect it has something to do with context switches. That the kernel
> does a CLREX when we switch, so if we switch between LDREX and STREX, the
> reservation could be lost. But why would it then be okay to ignore this
> sometimes?
>
>
>
> Thanks!
>
>
>
> Thomas
>
>
>
>
>
> Cheers, Richard.
>
>
>
> *From: *porters-dev <porters-dev-retn at openjdk.org> on behalf of Thomas
> St?fe <thomas.stuefe at gmail.com>
> *Date: *Saturday, 11. March 2023 at 11:19
> *To: *porters-dev at openjdk.org <porters-dev at openjdk.org>,
> aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
> *Subject: *Question about CAS via LDREX/STREX on 32-bit arm
>
> Hi ARM experts,
>
> I am trying to understand how CAS is implemented on arm; in particular,
> "MacroAssembler::atomic_cas_bool":
>
> MacroAssembler::atomic_cas_bool
>
>
>
> ```
>
>     assert_different_registers(tmp_reg, oldval, newval, base);
>     Label loop;
>     bind(loop);
> A   ldrex(tmp_reg, Address(base, offset));
> B   subs(tmp_reg, tmp_reg, oldval);
> C   strex(tmp_reg, newval, Address(base, offset), eq);
> D   cmp(tmp_reg, 1, eq);
> E   b(loop, eq);
> F   cmp(tmp_reg, 0);
>     if (tmpreg == noreg) {
>       pop(tmp_reg);
>     }
> ```
>
> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to
> newval. It does so in a loop. The code distinguishes two failures: STREX
> failing, and a "semantically failed" CAS.
>
> Here is what I think this code does:
>
>
>
> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>    After this, if the store succeeded, tmp_reg is 0, if the store failed
> its 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
> *(base+offset) had been modified before LDREX.
>    We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1
>
> So we loop until either *(base+offset) had been changed to some other
> value concurrently before out LDREX. Or until our store succeeded.
>
> I wondered what the loop guards against. And why it would be okay
> sometimes to omit it.
>
> IIUC, STREX fails if the core did lose its exclusive access to the memory
> location since the LDREX. This can be one of three things, right? :
> - another core slipped in an LDREX+STREX to the same location between our
> LDREX and STREX
> - Or we context switched to another thread or process. I assume it does a
> CLREX then, right? Because how could you prevent a sequence like "LDREX(p1)
> -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM
> manual [1] correctly, a STREX to a different location than the preceding
> LDREX is undefined.
> - Or we had a signal after LDREX and did a second LDREX in the signal
> handler. Does the kernel do a CLREX when invoking a signal handler?
>
> More questions:
>
> - If I got it right, at (D), tmp_reg value "1" has two meanings: either
> STREX failed or some thread increased the value concurrently by 1. We
> repeat the loop either way. Is this just accepted behavior? Increasing by 1
> is maybe not that rare.
>
> - If I understood this correctly, the loop guards us mainly against
> context switches. Without the loop a context switch would count as a
> "semantically failed" CAS. Why would that be okay? Should we not do this
> loop always?
>
> - Do we not need an explicit CLREX after the operation? Or does the STREX
> also clear the hardware monitor? Or does it just not matter?
>
> - We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded
> by this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes
> ldrex/strex (e.g. the one-shot path of
> MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now
> generally available? Does ARMv6 mean STREX and LDREX are available?
>
>
> Thanks a lot!
>
> Cheers, Thomas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230313/b4ed4721/attachment-0001.htm>

From richard.reingruber at sap.com  Mon Mar 13 10:25:13 2023
From: richard.reingruber at sap.com (Reingruber, Richard)
Date: Mon, 13 Mar 2023 10:25:13 +0000
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <CAA-vtUyf1LGjYDhEOg=SB94n-Tq7zTDr2qYAZUt-pFAgJ+-HNA@mail.gmail.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
 <AS4PR02MB8504AFA684AFA2CB27131D8D9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
 <CAA-vtUxbZe+fG2-h=Kz8=1OhyQ+P1WLGb-yAYea=fRjt0sXKQQ@mail.gmail.com>
 <AS4PR02MB8504D804BCE39E4FC24E08BF9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
 <CAA-vtUyf1LGjYDhEOg=SB94n-Tq7zTDr2qYAZUt-pFAgJ+-HNA@mail.gmail.com>
Message-ID: <AS4PR02MB8504C2AF110C1E342DB4B2D19BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>

> This whole coding lives in MacroAssembler::atomic_cas_bool, but that function
> is not always called. There are plenty of direct usages of ldrex+strex. There
> is even a condition argument for MacroAssembler::cas_for_lock_acquire to
> either do strex directly or to dive down into atomic_cas_bool.

The check of the strex result is not local in these cases.
MacroAssembler::cas_for_lock_acquire checks here: https://github.com/openjdk/jdk/blob/25e7ac226a3be9c064c0a65c398a8165596150f7/src/hotspot/cpu/arm/macroAssembler_arm.cpp#L1195
Failure handling is done in the slow_case.

From: Thomas St?fe <thomas.stuefe at gmail.com>
Date: Monday, 13. March 2023 at 11:07
To: Reingruber, Richard <richard.reingruber at sap.com>
Cc: porters-dev at openjdk.org <porters-dev at openjdk.org>, aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
Subject: Re: Question about CAS via LDREX/STREX on 32-bit arm
This whole coding lives in MacroAssembler::atomic_cas_bool, but that function is not always called. There are plenty of direct usages of ldrex+strex. There is even a condition argument for MacroAssembler::cas_for_lock_acquire to either do strex directly or to dive down into atomic_cas_bool.

On Mon, Mar 13, 2023 at 11:03?AM Reingruber, Richard <richard.reingruber at sap.com<mailto:richard.reingruber at sap.com>> wrote:
> Yes, I read it the same way. So we repeat the CAS for lost reservation. I'm
> interested in when this could happen and why it would be okay to sometimes
> omit this loop and do the "raw" LDREX-STREX sequence. See my original mail.

Hm, I don't understand. The loop in the sequence A-F is always there. How is it omitted?


From: Thomas St?fe <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>>
Date: Monday, 13. March 2023 at 10:54
To: Reingruber, Richard <richard.reingruber at sap.com<mailto:richard.reingruber at sap.com>>
Cc: porters-dev at openjdk.org<mailto:porters-dev at openjdk.org> <porters-dev at openjdk.org<mailto:porters-dev at openjdk.org>>, aarch32-port-dev at openjdk.org<mailto:aarch32-port-dev at openjdk.org> <aarch32-port-dev at openjdk.org<mailto:aarch32-port-dev at openjdk.org>>
Subject: Re: Question about CAS via LDREX/STREX on 32-bit arm
Hi Richard :)


On Mon, Mar 13, 2023 at 10:02?AM Reingruber, Richard <richard.reingruber at sap.com<mailto:richard.reingruber at sap.com>> wrote:
> Hi ARM experts,

Hi Thomas, not at all an ARM expert... :)
but I think I understand the code.

> I am trying to understand how CAS is implemented on arm; in particular, "MacroAssembler::atomic_cas_bool":

> MacroAssembler::atomic_cas_bool

> ```
>     assert_different_registers(tmp_reg, oldval, newval, base);
>     Label loop;
>     bind(loop);
> A   ldrex(tmp_reg, Address(base, offset));
> B   subs(tmp_reg, tmp_reg, oldval);
> C   strex(tmp_reg, newval, Address(base, offset), eq);
> D   cmp(tmp_reg, 1, eq);
> E   b(loop, eq);
> F   cmp(tmp_reg, 0);
>     if (tmpreg == noreg) {
>       pop(tmp_reg);
>     }
> ```

> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to newval. It does so in a loop. The code distinguishes two failures: STREX failing, and a "semantically failed" CAS.

> Here is what I think this code does:

> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>    After this, if the store succeeded, tmp_reg is 0, if the store failed its 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if *(base+offset) had been modified before LDREX.
>    We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1

> So we loop until either *(base+offset) had been changed to some other value concurrently before out LDREX. Or until our store succeeded.

> I wondered what the loop guards against. And why it would be okay sometimes to omit it.

The loop is needed to try again if the reservation was lost until the STREX succeeds or *(base+offset) != oldvalue.

So there are two cases. The loop is left iff

(1) *(base+offset) != oldvalue
(2) the STREX succeeded

First it is important to understand that C, D, E are only executed if at B the eq-condition is set to true.
This is based on the "Conditional Execution" feature of ARM: execution of most instructions can be made dependent on a condition (see https://developer.arm.com/documentation/den0013/d/ARM-Thumb-Unified-Assembly-Language-Instructions/Instruction-set-basics/Conditional-execution?lang=en<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.arm.com%2Fdocumentation%2Fden0013%2Fd%2FARM-Thumb-Unified-Assembly-Language-Instructions%2FInstruction-set-basics%2FConditional-execution%3Flang%3Den&data=05%7C01%7Crichard.reingruber%40sap.com%7C217dfeac228443e2b7e908db23aacb0e%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638142988683859436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PE%2Fg2hyJRes%2FsQ5ZCjCCxIotC%2F9RSY7jpVIT6owG3Dc%3D&reserved=0>)

So in case (1) C, D, E are not executed because B indicates that *(base+offset) and oldvalue are not-eq and the loop is left.


Ah, thanks, that was my thinking error. I did not realize that CMP was also conditional. I assumed the "eq" in the CMP (D) was a condition for the CMP. Which makes no sense, as I know now, since CMP just does a sub and only needs two arguments. So that meant the full branch CDE was controlled from the subtraction result at B.

That also resolves the "1 has a double meaning" question. It hasn't.

In case (2) C, D, E are executed. At C, if the reservation from A still exists, tmp_reg will be set to 0 otherwise to 1. At E the branch is taken if D indicated tmp_reg == 1 (reservation was lost) otherwise the loop is left.

Yes, I read it the same way. So we repeat the CAS for lost reservation. I'm interested in when this could happen and why it would be okay to sometimes omit this loop and do the "raw" LDREX-STREX sequence. See my original mail.

I suspect it has something to do with context switches. That the kernel does a CLREX when we switch, so if we switch between LDREX and STREX, the reservation could be lost. But why would it then be okay to ignore this sometimes?

Thanks!

Thomas


Cheers, Richard.

From: porters-dev <porters-dev-retn at openjdk.org<mailto:porters-dev-retn at openjdk.org>> on behalf of Thomas St?fe <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>>
Date: Saturday, 11. March 2023 at 11:19
To: porters-dev at openjdk.org<mailto:porters-dev at openjdk.org> <porters-dev at openjdk.org<mailto:porters-dev at openjdk.org>>, aarch32-port-dev at openjdk.org<mailto:aarch32-port-dev at openjdk.org> <aarch32-port-dev at openjdk.org<mailto:aarch32-port-dev at openjdk.org>>
Subject: Question about CAS via LDREX/STREX on 32-bit arm
Hi ARM experts,

I am trying to understand how CAS is implemented on arm; in particular, "MacroAssembler::atomic_cas_bool":

MacroAssembler::atomic_cas_bool

```
    assert_different_registers(tmp_reg, oldval, newval, base);
    Label loop;
    bind(loop);
A   ldrex(tmp_reg, Address(base, offset));
B   subs(tmp_reg, tmp_reg, oldval);
C   strex(tmp_reg, newval, Address(base, offset), eq);
D   cmp(tmp_reg, 1, eq);
E   b(loop, eq);
F   cmp(tmp_reg, 0);
    if (tmpreg == noreg) {
      pop(tmp_reg);
    }
```

It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to newval. It does so in a loop. The code distinguishes two failures: STREX failing, and a "semantically failed" CAS.
Here is what I think this code does:

A) LDREX: tmp=*(base+offset)
B) tmp -= oldvalue
   If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
   After this, if the store succeeded, tmp_reg is 0, if the store failed its 1.
D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if *(base+offset) had been modified before LDREX.
   We now compare with 1 and ...
E) ...repeat the loop if tmp_reg was 1

So we loop until either *(base+offset) had been changed to some other value concurrently before out LDREX. Or until our store succeeded.

I wondered what the loop guards against. And why it would be okay sometimes to omit it.

IIUC, STREX fails if the core did lose its exclusive access to the memory location since the LDREX. This can be one of three things, right? :
- another core slipped in an LDREX+STREX to the same location between our LDREX and STREX
- Or we context switched to another thread or process. I assume it does a CLREX then, right? Because how could you prevent a sequence like "LDREX(p1) -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM manual [1] correctly, a STREX to a different location than the preceding LDREX is undefined.
- Or we had a signal after LDREX and did a second LDREX in the signal handler. Does the kernel do a CLREX when invoking a signal handler?

More questions:

- If I got it right, at (D), tmp_reg value "1" has two meanings: either STREX failed or some thread increased the value concurrently by 1. We repeat the loop either way. Is this just accepted behavior? Increasing by 1 is maybe not that rare.

- If I understood this correctly, the loop guards us mainly against context switches. Without the loop a context switch would count as a "semantically failed" CAS. Why would that be okay? Should we not do this loop always?

- Do we not need an explicit CLREX after the operation? Or does the STREX also clear the hardware monitor? Or does it just not matter?

- We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded by this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes ldrex/strex (e.g. the one-shot path of MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now generally available? Does ARMv6 mean STREX and LDREX are available?


Thanks a lot!

Cheers, Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230313/3b69fdcb/attachment-0001.htm>

From thomas.stuefe at gmail.com  Mon Mar 13 10:34:53 2023
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Mon, 13 Mar 2023 11:34:53 +0100
Subject: Question about CAS via LDREX/STREX on 32-bit arm
In-Reply-To: <AS4PR02MB8504C2AF110C1E342DB4B2D19BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
References: <CAA-vtUwB+tS_gUxokZz2g2S6YF8mGu7=NyNS2u6kKd-xKxWQDw@mail.gmail.com>
 <AS4PR02MB8504AFA684AFA2CB27131D8D9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
 <CAA-vtUxbZe+fG2-h=Kz8=1OhyQ+P1WLGb-yAYea=fRjt0sXKQQ@mail.gmail.com>
 <AS4PR02MB8504D804BCE39E4FC24E08BF9BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
 <CAA-vtUyf1LGjYDhEOg=SB94n-Tq7zTDr2qYAZUt-pFAgJ+-HNA@mail.gmail.com>
 <AS4PR02MB8504C2AF110C1E342DB4B2D19BB99@AS4PR02MB8504.eurprd02.prod.outlook.com>
Message-ID: <CAA-vtUyyRkuU4NYpALbQ3jACOC0BrFYO6t_bmPh5vnReW1fscQ@mail.gmail.com>

Got it, thanks. In the one-shot case, if STREX failed because we lost
exclusive access, we would fall through to the slow case. So there should
be nothing semantically different happening.

On Mon, Mar 13, 2023 at 11:25?AM Reingruber, Richard <
richard.reingruber at sap.com> wrote:

> > This whole coding lives in MacroAssembler::atomic_cas_bool, but that
> function
>
> > is not always called. There are plenty of direct usages of ldrex+strex.
> There
>
> > is even a condition argument for MacroAssembler::cas_for_lock_acquire to
>
> > either do strex directly or to dive down into atomic_cas_bool.
>
>
>
> The check of the strex result is not local in these cases.
>
> MacroAssembler::cas_for_lock_acquire checks here:
> https://github.com/openjdk/jdk/blob/25e7ac226a3be9c064c0a65c398a8165596150f7/src/hotspot/cpu/arm/macroAssembler_arm.cpp#L1195
>
> Failure handling is done in the slow_case.
>
>
>
> *From: *Thomas St?fe <thomas.stuefe at gmail.com>
> *Date: *Monday, 13. March 2023 at 11:07
> *To: *Reingruber, Richard <richard.reingruber at sap.com>
> *Cc: *porters-dev at openjdk.org <porters-dev at openjdk.org>,
> aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
> *Subject: *Re: Question about CAS via LDREX/STREX on 32-bit arm
>
> This whole coding lives in MacroAssembler::atomic_cas_bool, but that
> function is not always called. There are plenty of direct usages of
> ldrex+strex. There is even a condition argument for
> MacroAssembler::cas_for_lock_acquire to either do strex directly or to dive
> down into atomic_cas_bool.
>
>
>
> On Mon, Mar 13, 2023 at 11:03?AM Reingruber, Richard <
> richard.reingruber at sap.com> wrote:
>
> > Yes, I read it the same way. So we repeat the CAS for lost reservation.
> I'm
>
> > interested in when this could happen and why it would be okay to
> sometimes
>
> > omit this loop and do the "raw" LDREX-STREX sequence. See my original
> mail.
>
>
>
> Hm, I don't understand. The loop in the sequence A-F is always there. How
> is it omitted?
>
>
>
>
>
> *From: *Thomas St?fe <thomas.stuefe at gmail.com>
> *Date: *Monday, 13. March 2023 at 10:54
> *To: *Reingruber, Richard <richard.reingruber at sap.com>
> *Cc: *porters-dev at openjdk.org <porters-dev at openjdk.org>,
> aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
> *Subject: *Re: Question about CAS via LDREX/STREX on 32-bit arm
>
> Hi Richard :)
>
>
>
>
>
> On Mon, Mar 13, 2023 at 10:02?AM Reingruber, Richard <
> richard.reingruber at sap.com> wrote:
>
> > Hi ARM experts,
>
>
>
> Hi Thomas, not at all an ARM expert... :)
>
> but I think I understand the code.
>
>
>
> > I am trying to understand how CAS is implemented on arm; in particular,
> "MacroAssembler::atomic_cas_bool":
>
>
>
> > MacroAssembler::atomic_cas_bool
>
>
>
> > ```
>
> >     assert_different_registers(tmp_reg, oldval, newval, base);
>
> >     Label loop;
>
> >     bind(loop);
>
> > A   ldrex(tmp_reg, Address(base, offset));
>
> > B   subs(tmp_reg, tmp_reg, oldval);
>
> > C   strex(tmp_reg, newval, Address(base, offset), eq);
>
> > D   cmp(tmp_reg, 1, eq);
>
> > E   b(loop, eq);
>
> > F   cmp(tmp_reg, 0);
>
> >     if (tmpreg == noreg) {
>
> >       pop(tmp_reg);
>
> >     }
>
> > ```
>
>
>
> > It uses LDREX and STREX to perform a cas of *(base+offset) from oldval
> to newval. It does so in a loop. The code distinguishes two failures: STREX
> failing, and a "semantically failed" CAS.
>
>
>
> > Here is what I think this code does:
>
>
>
> > A) LDREX: tmp=*(base+offset)
>
> > B) tmp -= oldvalue
>
> >    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
>
> > C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise,
> omit.
>
> >    After this, if the store succeeded, tmp_reg is 0, if the store failed
> its 1.
>
> > D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
> *(base+offset) had been modified before LDREX.
>
> >    We now compare with 1 and ...
>
> > E) ...repeat the loop if tmp_reg was 1
>
>
>
> > So we loop until either *(base+offset) had been changed to some other
> value concurrently before out LDREX. Or until our store succeeded.
>
>
>
> > I wondered what the loop guards against. And why it would be okay
> sometimes to omit it.
>
>
>
> The loop is needed to try again if the reservation was lost until the
> STREX succeeds or *(base+offset) != oldvalue.
>
>
>
> So there are two cases. The loop is left iff
>
>
>
> (1) *(base+offset) != oldvalue
>
> (2) the STREX succeeded
>
>
>
> First it is important to understand that C, D, E are only executed if at
> B the eq-condition is set to true.
>
> This is based on the "Conditional Execution" feature of ARM: execution of
> most instructions can be made dependent on a condition (see
> https://developer.arm.com/documentation/den0013/d/ARM-Thumb-Unified-Assembly-Language-Instructions/Instruction-set-basics/Conditional-execution?lang=en
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdeveloper.arm.com%2Fdocumentation%2Fden0013%2Fd%2FARM-Thumb-Unified-Assembly-Language-Instructions%2FInstruction-set-basics%2FConditional-execution%3Flang%3Den&data=05%7C01%7Crichard.reingruber%40sap.com%7C217dfeac228443e2b7e908db23aacb0e%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C638142988683859436%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PE%2Fg2hyJRes%2FsQ5ZCjCCxIotC%2F9RSY7jpVIT6owG3Dc%3D&reserved=0>
> )
>
>
>
> So in case (1) C, D, E are not executed because B indicates that
> *(base+offset) and oldvalue are not-eq and the loop is left.
>
>
>
>
>
> Ah, thanks, that was my thinking error. I did not realize that CMP was
> also conditional. I assumed the "eq" in the CMP (D) was a condition for the
> CMP. Which makes no sense, as I know now, since CMP just does a sub and
> only needs two arguments. So that meant the full branch CDE was controlled
> from the subtraction result at B.
>
>
>
> That also resolves the "1 has a double meaning" question. It hasn't.
>
>
>
> In case (2) C, D, E are executed. At C, if the reservation from A still
> exists, tmp_reg will be set to 0 otherwise to 1. At E the branch is taken
> if D indicated tmp_reg == 1 (reservation was lost) otherwise the loop is
> left.
>
>
>
> Yes, I read it the same way. So we repeat the CAS for lost reservation.
> I'm interested in when this could happen and why it would be okay to
> sometimes omit this loop and do the "raw" LDREX-STREX sequence. See my
> original mail.
>
>
>
> I suspect it has something to do with context switches. That the kernel
> does a CLREX when we switch, so if we switch between LDREX and STREX, the
> reservation could be lost. But why would it then be okay to ignore this
> sometimes?
>
>
>
> Thanks!
>
>
>
> Thomas
>
>
>
>
>
> Cheers, Richard.
>
>
>
> *From: *porters-dev <porters-dev-retn at openjdk.org> on behalf of Thomas
> St?fe <thomas.stuefe at gmail.com>
> *Date: *Saturday, 11. March 2023 at 11:19
> *To: *porters-dev at openjdk.org <porters-dev at openjdk.org>,
> aarch32-port-dev at openjdk.org <aarch32-port-dev at openjdk.org>
> *Subject: *Question about CAS via LDREX/STREX on 32-bit arm
>
> Hi ARM experts,
>
> I am trying to understand how CAS is implemented on arm; in particular,
> "MacroAssembler::atomic_cas_bool":
>
> MacroAssembler::atomic_cas_bool
>
>
>
> ```
>
>     assert_different_registers(tmp_reg, oldval, newval, base);
>     Label loop;
>     bind(loop);
> A   ldrex(tmp_reg, Address(base, offset));
> B   subs(tmp_reg, tmp_reg, oldval);
> C   strex(tmp_reg, newval, Address(base, offset), eq);
> D   cmp(tmp_reg, 1, eq);
> E   b(loop, eq);
> F   cmp(tmp_reg, 0);
>     if (tmpreg == noreg) {
>       pop(tmp_reg);
>     }
> ```
>
> It uses LDREX and STREX to perform a cas of *(base+offset) from oldval to
> newval. It does so in a loop. The code distinguishes two failures: STREX
> failing, and a "semantically failed" CAS.
>
> Here is what I think this code does:
>
>
>
> A) LDREX: tmp=*(base+offset)
> B) tmp -= oldvalue
>    If *(base+offset) was unchanged, tmp_reg is now 0 and Z is 1
> C) If Z is 1: STREX the new value: *(base+offset)=newval. Otherwise, omit.
>    After this, if the store succeeded, tmp_reg is 0, if the store failed
> its 1.
> D) Here, tmp_reg is: 0 if the store succeeded, 1 if it failed, 1...n if
> *(base+offset) had been modified before LDREX.
>    We now compare with 1 and ...
> E) ...repeat the loop if tmp_reg was 1
>
> So we loop until either *(base+offset) had been changed to some other
> value concurrently before out LDREX. Or until our store succeeded.
>
> I wondered what the loop guards against. And why it would be okay
> sometimes to omit it.
>
> IIUC, STREX fails if the core did lose its exclusive access to the memory
> location since the LDREX. This can be one of three things, right? :
> - another core slipped in an LDREX+STREX to the same location between our
> LDREX and STREX
> - Or we context switched to another thread or process. I assume it does a
> CLREX then, right? Because how could you prevent a sequence like "LDREX(p1)
> -> switch -> LDREX(p2) -> switch back STREX(p1)" - if I understand the ARM
> manual [1] correctly, a STREX to a different location than the preceding
> LDREX is undefined.
> - Or we had a signal after LDREX and did a second LDREX in the signal
> handler. Does the kernel do a CLREX when invoking a signal handler?
>
> More questions:
>
> - If I got it right, at (D), tmp_reg value "1" has two meanings: either
> STREX failed or some thread increased the value concurrently by 1. We
> repeat the loop either way. Is this just accepted behavior? Increasing by 1
> is maybe not that rare.
>
> - If I understood this correctly, the loop guards us mainly against
> context switches. Without the loop a context switch would count as a
> "semantically failed" CAS. Why would that be okay? Should we not do this
> loop always?
>
> - Do we not need an explicit CLREX after the operation? Or does the STREX
> also clear the hardware monitor? Or does it just not matter?
>
> - We have VM_Version::supports_ldrex(). Code seems to me sometimes guarded
> by this (e.g MacroAssembler::atomic_cas_bool), sometimes code just executes
> ldrex/strex (e.g. the one-shot path of
> MacroAssembler::cas_for_lock_acquire). Am I mistaken? Or is LDREX now
> generally available? Does ARMv6 mean STREX and LDREX are available?
>
>
> Thanks a lot!
>
> Cheers, Thomas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/porters-dev/attachments/20230313/b6397c54/attachment-0001.htm>

From jorn.vernee at oracle.com  Fri Mar 31 16:56:14 2023
From: jorn.vernee at oracle.com (Jorn Vernee)
Date: Fri, 31 Mar 2023 18:56:14 +0200
Subject: Making the foreign liner a required API
Message-ID: <3b599687-5563-65d2-e221-5ae94d65e0be@oracle.com>

Hello,

We (the panama-foreign team), want to make the foreign linker, which is 
returned by the java.lang.foreign.Linker::nativeLinker() method a 
required API for Java 21. This move is needed in order to be able to 
start rewriting parts of the JDK itself to use the linker API.

In Java 20, this API is specified to throw an 
UnsupportedOperationException on unsupported platforms [1]. Making the 
API required on all platforms means removing this exception 
specification, meaning that implementations that continue to throw it 
would not be compliant to the spec.

There are several ports already:
- RISC-V: https://github.com/openjdk/jdk/pull/11004
- Windows/AArch64: https://github.com/openjdk/jdk/pull/12773
- PPC64le: https://github.com/openjdk/jdk/pull/12708 (not yet integrated)
- Zero VM: the zero VM configuration works with the fallback linker (see 
below)

For these ports nothing would need to be done.

For other ports, an implementation of the native linker would need to be 
provided. To facilitate this, we are adding a 'fallback' linker 
implementation as part of JEP 442 [2]. This allows a JDK to use libffi 
in order to implement (most of) the functionality of a linker. This 
approach is similar to the alternative implementation provided for 
VMContinuations, but would require ports to also bundle a working 
version of libffi.

The fallback linker can be enabled with the --enable-fallback-linker 
configure flag. When doing that, the build system will also require 
libffi to be available. I've tested this on Linux/x64 using libffi 
version 3.4.2. It is noteworthy that I had to build libffi from source, 
as the version that came with the particular distribution of Ubuntu I 
was using didn't work (resulting in SEGVs at runtime). I also had to use 
the same toolchain for building libffi as was used to build the JDK 
itself. The JEP 442 patch also contains a build script under 
make/devkit/createLibffiBundle.sh [3] which describes the steps needed 
to create a working libffi package on Linux/x64.

I would like to ask maintainers of other ports to try out the JEP patch 
together with the fallback linker/libffi, so that we can 'flip the 
switch' and make the linker a required API later in the 21 release cycle 
without a hitch.

Thanks,
Jorn Vernee

[1]: 
https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/lang/foreign/Linker.html#nativeLinker()
[2]: https://github.com/openjdk/jdk/pull/13079
[3]: 
https://github.com/openjdk/jdk/blob/928ad35e570abaac9103989d603e3ef278568354/make/devkit/createLibffiBundle.sh