[aarch64-port-dev ] RFR: aarch64: minor improvements of atomic operations

Tue Nov 12 17:38:11 UTC 2019

Hi Felix,

I was hoping to stay out of this conversation, but couldn't resist 
butting in unfortunately.
I have to agree with you - you are absolutely right. We have a mix of 
the JMM, the C++ memory model and HotSpot's memory model, which predates 
that. JMM and C++ memory model are indeed quite similar now in terms of 
semantics (yet there exists choice in implementation of it), but the old 
memory model used in HotSpot is kind of not similar. Ideally we would 
have less memory models and just go with the one used by C++/JMM, and 
then we just have to convince ourselves that the choice of 
implementation of seq_cst by the compiler is compatible to the one we 
use to implement the JMM in our JIT-compiled code. But it seems to me 
that we are not there.

Last time I discussed this with Andrew Haley, we disagreed and didn't 
really get anywhere. Andrew wanted to use the GCC intrinsics, and I was 
arguing that we should use inline assembly as a) the memory model we are 
supporting is not the same as what the intrinsic is providing, and b) we 
are relying on the implementation of the intrinsics to emit very 
specific instruction sequences to be compatible with the memory model, 
and it would be more clear if we could see in the inline assembly that 
we indeed used exactly those instructions that we expected and not 
something unexpected, which we would only randomly find out when 
disassembling the code (ahem).

Now it looks like you have discovered that we sometimes have double 
trailing dmb ish, and sometimes lacking leading dmb ish if I am reading 
this right. That seems to make the case stronger, that by looking at the 
intrinsic calls, it's not obvious what instruction sequence will be 
emitted, and whether that is compatible with the memory model it is 
implementing or not, and you really have to disassemble it to find out 
what we actually got. And it looks like what we got is not at all what 
we wanted.

My hope is that the AArch64 port should use inline assembly as you 
suggest, so we can see that the generated code is correct, as we wait 
for the glorious future where all HotSpot code has been rewritten to 
work with seq_cst (and we are *not* there now).

Having said that, now I will try to go and hide in a corner again...

Thanks,
/Erik

On 2019-11-12 13:14, Yangfei (Felix) wrote:
>> -----Original Message-----
>> From: Yangfei (Felix)
>> Sent: Tuesday, November 12, 2019 8:03 PM
>> To: 'Andrew Dinn' <adinn at redhat.com>; Andrew Haley <aph at redhat.com>;
>> aarch64-port-dev at openjdk.java.net
>> Cc: hotspot-dev at openjdk.java.net
>> Subject: RE: [aarch64-port-dev ] RFR: aarch64: minor improvements of atomic
>> operations
>>
>>> -----Original Message-----
>>> From: Andrew Dinn [mailto:adinn at redhat.com]
>>> Sent: Tuesday, November 12, 2019 5:42 PM
>>> To: Andrew Haley <aph at redhat.com>; Yangfei (Felix)
>>> <felix.yang at huawei.com>; aarch64-port-dev at openjdk.java.net
>>> Cc: hotspot-dev at openjdk.java.net
>>> Subject: Re: [aarch64-port-dev ] RFR: aarch64: minor improvements of
>>> atomic operations
>>>
>>> On 12/11/2019 09:25, Andrew Haley wrote:
>>>> On 11/12/19 8:37 AM, Yangfei (Felix) wrote:
>>>>> This has been discussed somewhere before:
>>>>> https://patchwork.kernel.org/patch/3575821/
>>>>> Let's keep the current status for safe.
>>>> Yes.
>>>>
>>>> It's been interesting to see the progress of this patch. I don't
>>>> think it's the first time that someone has been tempted to change
>>>> this code to make it "more efficient".
>>>>
>>>> I wonder if we could perhaps add a comment to that code so that it
>>>> doesn't happen again. I'm not sure exactly what the patch should say
>>>> beyond "do not touch". Perhaps something along the lines of "Do not
>>>> touch this code unless you have at least Black Belt, 4th Dan in
>>>> memory ordering."  :-)
>>>>
>>>> More seriously, maybe simply "Note that memory_order_conservative
>>>> requires a full barrier after atomic stores. See
>>>> https://patchwork.kernel.org/patch/3575821/"
>>> Yes, that would be a help. It's particularly easy to get confused here
>>> because we happily omit the ordering of an stlr store wrt subsequent
>>> stores when the strl is implementing a Java volatile write or a Java cmpxchg.
>>>
>>> So, it might be worth adding a rider that implementing the full
>>> memory_order_conservative semantics is necessary because VM code
>>> relies on the strong ordering wrt writes that the cmpxchg is required to
>> provide.
>> I also suggest we implement these functions with inline assembly here.
>> For Atomic::PlatformXchg, we may issue two consecutive full memory barriers
>> with the current status.
>> I used GCC 7.3.0 to compile the following function:
>>
>> $ cat test.c
>> long foo(long add_value, long volatile* dest, long exchange_value) {
>>    long val = __sync_lock_test_and_set(dest, exchange_value);
>>
>>    __sync_synchronize();
>>
>>    return val;
>> }
>>
>> $ cat test.s
>>          .arch armv8-a
>>          .file   "test.c"
>>          .text
>>          .align  2
>>          .p2align 3,,7
>>          .global foo
>>          .type   foo, %function
>> foo:
>> .L2:
>>          ldxr    x0, [x1]
>>          stxr    w3, x2, [x1]
>>          cbnz    w3, .L2
>>          dmb     ish           < ========
>>          dmb     ish           < ========
>>          ret
>>          .size   foo, .-foo
>>          .ident  "GCC: (GNU) 7.3.0"
>>          .section        .note.GNU-stack,"", at progbits
> Also this is different from the following sequence (stxr instead of stlxr).
>
> 	<Access [A]>
>
> 	// atomic_op (B)
> 1:	ldxr	x0, [B]		// Exclusive load
> 	<op(B)>
> 	stlxr	w1, x0, [B]	// Exclusive store with release
> 	cbnz	w1, 1b
> 	dmb	ish		// Full barrier
>
> 	<Access [C]>
>
> I think the two-way memory barrier may not be ensured for this case.
>
> Felix