RFR: 8189596: AArch64: implementation for Thread-local handshakes

Fri Nov 24 15:10:51 UTC 2017

Hi Andrew,

On 2017-11-24 15:39, Andrew Dinn wrote:
> On 24/11/17 13:36, Erik Österlund wrote:
>> On 2017-11-24 13:07, Andrew Dinn wrote:
>>> On 24/11/17 10:36, Erik Österlund wrote:
>>>> By placing loading the local poll value with ldar *only* in the native
>>>> wrapper wakeup function, you avoid these issues.
>>>> Another approach you may elect to use is to only in the native wrapper
>>>> load both the thread-local poll value and the
>>>> SafepointSynchronize::_state, when filtering slow paths to avoid this
>>>> unfortunate race.
>>> I can see why an ldar (actually ldarw) is needed when safepoint_poll is
>>> called from the nativewrapper. Can you explain why ldar is not needed
>>> for *all* calls to safepoint_poll?
>> That is a long story. :) But since you asked, here we go...
>> . . .
>> I hope this sheds some light on the important races you need to be aware
>> of.
> Well, that's a good story and maybe needs to be included somewhere in
> the code even if only in precis. The asymmetry between start and end
> makes clear why you want an ldar in the native wrapper.

Thank you. And yes, you are right - this should probably be documented.

> The one detail I am still not sure of is how this design ensures that
> the benign races in JIT/interpreter ever come to an end. What guarantees
> that JITted code or the interpreter actually performs an acquiring load
> and thereby notices a change to the armed poll value? That might take a
> very long time (if it happens at all?).

The JMM defines opaque stores as stores that among other properties 
guarantee progress: the store will eventually become observable to other 
threads. The stores that arm the thread-local polls use a 
release_store(), which is strictly stronger than opaque on each platform 
(and hence have the guarantee that they will "eventually" become 
observable to opaque loads). Therefore, the fast-path in JIT-compiled 
code and interpreter code are guaranteed to "eventually" observe those 
stores with opaque loads (and ldr is an opaque load). When they do, they 
will jump into the VM (directly or through a trampoline), and perform 
the acquiring load on the thread-local poll, used before loading the 
global state.

As for how long time it takes for the stores to become eventually 
observable by remote loads, I imagine it should be no worse than the 
delay of the TLB shootdown event. And just for completeness for this 
discussion: the arming of the local polls is followed by a fence() which 
performs a dmb ish in your GCC intrinsics.

Thanks,
/Erik

> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander