RFR: 8189596: AArch64: implementation for Thread-local handshakes

Fri Nov 24 16:19:10 UTC 2017

Hi Andrew,

On 2017-11-24 16:55, Andrew Dinn wrote:
> On 24/11/17 15:10, Erik Österlund wrote:
>> On 2017-11-24 15:39, Andrew Dinn wrote:
>>> The one detail I am still not sure of is how this design ensures that
>>> the benign races in JIT/interpreter ever come to an end. What guarantees
>>> that JITted code or the interpreter actually performs an acquiring load
>>> and thereby notices a change to the armed poll value? That might take a
>>> very long time (if it happens at all?).
>> The JMM defines opaque stores as stores that among other properties
>> guarantee progress: the store will eventually become observable to other
>> threads. The stores that arm the thread-local polls use a
>> release_store(), which is strictly stronger than opaque on each platform
>> (and hence have the guarantee that they will "eventually" become
>> observable to opaque loads). Therefore, the fast-path in JIT-compiled
>> code and interpreter code are guaranteed to "eventually" observe those
>> stores with opaque loads (and ldr is an opaque load). When they do, they
>> will jump into the VM (directly or through a trampoline), and perform
>> the acquiring load on the thread-local poll, used before loading the
>> global state.
>>
>> As for how long time it takes for the stores to become eventually
>> observable by remote loads, I imagine it should be no worse than the
>> delay of the TLB shootdown event. And just for completeness for this
>> discussion: the arming of the local polls is followed by a fence() which
>> performs a dmb ish in your GCC intrinsics.
> Well, that's really what I was thinking about with the original
> question. I understand that writes will become visible "eventually". I'm
> just concerned to know what sort of upper bound is implied by that word
> (theoretical might be interesting but I'm more interested in practical).
> This is the mechanism intended to halt individual threads or all
> threads. If there is any possibility of some extended delay --
> especially in the latter case -- then that would merit quantifying.
>
> I assume "no worse than the delay of the TLB shootdown event" means "TLB
> shootdown relies on the same sort of benign race"? If so that doesn't
> help quantify anything except to say that prior experience has not shown
> any significant delay (so far as we know).

Naturally the actual delay of a single store to make it through a store 
buffer and then the cache coherency protocol to eventually show up in a 
load on another CPU depends on a bunch of stuff that is bound to look 
different on every machine. And I can't say I know what typical delays 
look like on typical AArch64 machines. But I would be *very* surprised 
if it would get anywhere close to millisecond levels, which is typically 
when it starts being noticeable on the radar at all to a safepointing 
operation (compared to the time of doing the crazy things we do in 
safepoints). A gut feeling is that it should be comparable to the time 
for performing a dmb sy (which by contract makes preceding stores 
globally observable in the whole system).

Based on what Andrew had observed, it seems like the delays were 
dominated by the sparse polling of the interpreter, suggesting to me 
that it does not seem like a large issue to propagate the state change 
in practice.

Thanks,
/Erik

> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander