RFR (XS): 8188877: Improper synchronization in offer_termination

Tue Oct 10 01:44:37 UTC 2017

> On Oct 9, 2017, at 8:20 PM, White, Derek <Derek.White at cavium.com> wrote:
>> src/hotspot/share/gc/shared/taskqueue.cpp
>> 
>> The change to make _offered_termination volatile prevents the compiler
>> from hoisting it out of the loop, but I'm not sure it is enough to prevent
>> unnecessary delays in recognizing the termination condition.
>> In particular, while offer_termination is spinning, it's not clear there's
>> anything to ensure an update by some other thread will become visible
>> during the repeated spin cycles.  yield and sleep seem likely to involve fences
>> that will make such updates visible, but not so much for SpinPause.
> 
> [Derek] 
> The concurrent access around _offered_termination is (semi*)-independent of other data accesses. So I'm not sure if more ordering constraints will help much here?
> 
> The atomic incr/dec operations provide a fence to publish the change. On aarch64 this is a DMB instruction. A full sync (DSB) will also publish the change, but blocks until the publish has been acked. But it doesn't publish any faster.
> 
> The only extra thing we could do is read _offered_termination using OrderAccess::load_acquire(). I think the only thing this adds in this case is ordering between the loads of _offered_termination in different  loop iterations. Is this required or helpful? 

Consider the case where there is no further work to do, and all
threads are heading toward offer_termination.  Until the last thread
makes the offer, the others will be in the spin1...spinN/yield/sleep
cycle, as expected.  But because there appear to be no memory barriers
in the spin part of that cycle, a thread in the spin part might not
see the final offer until it completes the full spin cycle and then
performs the yield (assuming yield does have sufficient memory
barriers to acquire the published value).

So I think the read of _offered_termination for comparison with
_n_threads probably ought to be an acquire.

>> […]
>> 
>> src/hotspot/share/gc/shared/taskqueue.cpp
>> 
>> [pre-existing]
>> ParallelTaskTerminator::offer_termination computes the hard_spin_limit as
>>  hard_spin_limit = WorkStealingHardSpins >> WorkStealingSpinToYieldRatio;
>>  hard_spin_limit = MAX2(hard_spin_limit, 1U);
>> 
>> The use of a left shift will result in undefined behavior for seemingly
>> reasonable non-default choices for the ratio parameter, e.g. any value >=
>> number of bits in type of WorkStealingHardSpins.
>> 
>> Perhaps this should be addressed by limiting the permitted range for the ratio
>> parameter.

Thinking about this some more, WorkStealingSpinToYieldRatio seems
*really* poorly named; that hardly looks like a ratio at all.