RFR (XS): 8188877: Improper synchronization in offer_termination

Tue Nov 21 21:53:02 UTC 2017

Hi Kim, Thomas,

So Thomas, I think I agree with my original self too 😊.

Kim's concern:
> Consider the case where there is no further work to do, and all 
> threads are heading toward offer_termination.  Until the last thread 
> makes the offer, the others will be in the spin1...spinN/yield/sleep 
> cycle, as expected.  But because there appear to be no memory barriers 
> in the spin part of that cycle, a thread in the spin part might not 
> see the final offer until it completes the full spin cycle and then 
> performs the yield (assuming yield does have sufficient memory 
> barriers to acquire the published value).

My understanding is that the "acquire" semantics are entirely about memory ordering, within a CPU. In particular it prevents "following loads" from executing before the "load acquire". 

There is nothing in the "load acquire" that causes it to synchronize with the memory system more or less quickly than a naked load. Either kind of load will eventually notice that its local cached value has been invalidated and will fetch the updated value. 

In the context of a spin-loop, there could be some large number of loads (from the same variable), comparisons, and branches in flight. Without a load-acquire, it's not clear which load will see the final value, but it doesn't matter. As long as one iteration of the loop sees that value, the loop will terminate.

BTW, I have seen several explanations of what the Intel PAUSE instruction does, in its earliest forms it seemed to be about avoiding having to clean up either: mis-predicted instructions in flight, or carefully ordering the memory loads. I might be misremembering and I can't find my sources for this though.

 - Derek

> -----Original Message-----
> From: Thomas Schatzl [mailto:thomas.schatzl at oracle.com]
> Sent: Thursday, November 16, 2017 3:42 AM
> To: Kim Barrett <kim.barrett at oracle.com>; White, Derek
> <Derek.White at cavium.com>
> Cc: hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR (XS): 8188877: Improper synchronization in
> offer_termination
> 
> On Mon, 2017-10-09 at 21:44 -0400, Kim Barrett wrote:
> > > On Oct 9, 2017, at 8:20 PM, White, Derek <Derek.White at cavium.com>
> > > wrote:
> > > > src/hotspot/share/gc/shared/taskqueue.cpp
> > > >
> > > > The change to make _offered_termination volatile prevents the
> > > > compiler from hoisting it out of the loop, but I'm not sure it is
> > > > enough to prevent unnecessary delays in recognizing the
> > > > termination condition.
> > > > In particular, while offer_termination is spinning, it's not clear
> > > > there's anything to ensure an update by some other thread will
> > > > become visible during the repeated spin cycles.  yield and sleep
> > > > seem likely to involve fences that will make such updates visible,
> > > > but not so much for SpinPause.
> > >
> > > [Derek]
> > > The concurrent access around _offered_termination is (semi*)-
> > > independent of other data accesses. So I'm not sure if more ordering
> > > constraints will help much here?
> > >
> > > The atomic incr/dec operations provide a fence to publish the
> > > change. On aarch64 this is a DMB instruction. A full sync (DSB) will
> > > also publish the change, but blocks until the publish has been
> > > acked. But it doesn't publish any faster.
> > >
> > > The only extra thing we could do is read _offered_termination using
> > > OrderAccess::load_acquire(). I think the only thing this adds in
> > > this case is ordering between the loads of _offered_termination in
> > > different  loop iterations. Is this required or helpful?
> >
> > Consider the case where there is no further work to do, and all
> > threads are heading toward offer_termination.  Until the last thread
> > makes the offer, the others will be in the spin1...spinN/yield/sleep
> > cycle, as expected.  But because there appear to be no memory barriers
> > in the spin part of that cycle, a thread in the spin part might not
> > see the final offer until it completes the full spin cycle and then
> > performs the yield (assuming yield does have sufficient memory
> > barriers to acquire the published value).
> >
> > So I think the read of _offered_termination for comparison with
> > _n_threads probably ought to be an acquire.
> 
>   I agree with Derek that adding an acquire for the load of
> _offered_termination does not seem to improve termination properties, and
> is imho unnecessary (but the volatile declaration of course is).
> 
> Thanks,
>   Thomas