RFR (XS): 8188877: Improper synchronization in offer_termination

Tue Oct 10 00:20:27 UTC 2017

Hi Kim,

Thanks for the review. More comments below...

> -----Original Message-----
> From: Kim Barrett [mailto:kim.barrett at oracle.com]
> Sent: Sunday, October 08, 2017 5:32 PM
> To: White, Derek <Derek.White at cavium.com>
> Cc: hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR (XS): 8188877: Improper synchronization in
> offer_termination
> 
> > On Oct 6, 2017, at 7:13 PM, White, Derek <Derek.White at cavium.com>
> wrote:
> >
> > Hi All,
> >
> > The field ParallelTaskTerminator::_offered_termination is modified
> atomically, but is read as a normal field access.
> >
> > In ParallelTaskTerminator::offer_termination(), this field is read within a
> loop, but there's nothing stopping a compiler from hoisting the load out of
> the loop. See CR for examples of possible load hoisting.
> >
> > Trivial fix: The _offered_termination field should be declared volatile.
> >
> > This was found by code inspection. I haven't seen an actual error, but
> compiler optimizations or unrelated code changes could make an error occur
> in the future.
> >
> > CR: https://bugs.openjdk.java.net/browse/JDK-8188877
> > Webrev: http://cr.openjdk.java.net/~drwhite/8188877/webrev.01/
> >
> > Minor build and testing done, but inspected the generated code for
> offer_termination() and found no differences.
> >
> > 	• Derek
> 
> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
> 
> The change to make _offered_termination volatile prevents the compiler
> from hoisting it out of the loop, but I'm not sure it is enough to prevent
> unnecessary delays in recognizing the termination condition.
> In particular, while offer_termination is spinning, it's not clear there's
> anything to ensure an update by some other thread will become visible
> during the repeated spin cycles.  yield and sleep seem likely to involve fences
> that will make such updates visible, but not so much for SpinPause.

[Derek] 
The concurrent access around _offered_termination is (semi*)-independent of other data accesses. So I'm not sure if more ordering constraints will help much here?

The atomic incr/dec operations provide a fence to publish the change. On aarch64 this is a DMB instruction. A full sync (DSB) will also publish the change, but blocks until the publish has been acked. But it doesn't publish any faster.

The only extra thing we could do is read _offered_termination using OrderAccess::load_acquire(). I think the only thing this adds in this case is ordering between the loads of _offered_termination in different  loop iterations. Is this required or helpful? 

If so, we should probably encapsulate all accesses to _offered_termination and provide correct  access order constraints.

*SEMI_DEPENDENT DATA ACCESSES
Ok, offered_termination accesses are related to TaskQueue operations, esp. peek(). Those operations seem to involve volatile accesses also, and seem safe to me, but I didn't try to understand the whole implementation of TaskQueues.

> I also noticed the SpinPause implementations for linux_arm(64) and
> linux_aarch64 are different, with the latter perhaps being lacking.

[Derek] 
Hah! That's why I was looking at offer_termination() in the first place*.  There is more discussion about what to do with SpinPause on aarch64 than you'd first expect.

*And in the second place, offer_termination() shows up in the top 10 on perf, when benchmarking hadoop in a default GC config: 63+ GC threads don't have much to do when there's a only a 4 GB heap.

> ------------------------------------------------------------------------------
> 
> In addition, some pre-existing issues:
> 
> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
> 
> [pre-existing]
> The casts in ParallelTaskTerminator::offer_termination's
> Atomic::inc/dec of _offered_termination are no longer needed.

[Derek] 
OK.

> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
> 
> [pre-existing]
> In ParallelTaskTerminator::offer_termination, a reference to
> WorkStealingHardSpins is cast to uint.  The only reason for that cast is that
> the variable is declared uintx, when it really ought to be uint.  Probably all
> three of the WorkStealingMUMBLE experimental parameters ought to be of
> type uint rather than uintx.

[Derek] 
OK

> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
> 
> [pre-existing]
> ParallelTaskTerminator::offer_termination computes the hard_spin_limit as
>   hard_spin_limit = WorkStealingHardSpins >> WorkStealingSpinToYieldRatio;
>   hard_spin_limit = MAX2(hard_spin_limit, 1U);
> 
> The use of a left shift will result in undefined behavior for seemingly
> reasonable non-default choices for the ratio parameter, e.g. any value >=
> number of bits in type of WorkStealingHardSpins.
> 
> Perhaps this should be addressed by limiting the permitted range for the ratio
> parameter.

[Derek] 
OK.

> ------------------------------------------------------------------------------

[Derek] I'll get a new webrev out tomorrow.