RFR(S): 8040803: G1: Concurrent mark hangs when mark stack overflows

Mon May 5 11:57:52 UTC 2014

Hi Jon,

On 04/30/2014 07:52 PM, Jon Masamitsu wrote:
> Per,
>
> Adding a new flag sometimes is like adding a new degree
> of freedom and sometimes can make a complicated situation
> more complicated.
>
> Before I review this can you help  me understand the
> problem.   Is the window for the race condition this
> code in do_marking_step()?
>
>    4108    if (_cm->has_overflown()) {
>    4109      // This can happen if the mark stack overflows during a GC
> pause
>    4110      // and this task, after a yield point, restarts. We have to
> abort
>    4111      // as we need to get into the overflow protocol which happens
>    4112      // right at the end of this task.
>    4113      set_has_aborted();
>    4114    }
>
> The window being between the time _has_overflown is set and when
> _has_aborted is set?

The race is between checking _cm->has_overflown() and checking 
_cm->has_aborted(). Both of these are checked in a few places during 
marking (typically in regular_clock_call() and some other place). Since 
this code is executed by several threads in parallel, without 
synchronization, different threads can see one or the other state first 
depending on where a particular thread happens to be executing when the 
abort and overflow happens.

Note that the set_has_aborted() in the code above sets the CMTask local 
abort state, which is not part of the race here. _cm->has_aborted() is 
the global abort state, which is set when a Full GC happens.

/Per

>
> Jon
>
> On 4/30/2014 6:04 AM, Per Liden wrote:
>> Hi,
>>
>> Could I please have a couple of reviews in this bug fix:
>>
>> Summary: G1's concurrent marking can potentially hang forever if the
>> global mark stack overflows and immediately after that a Full GC
>> happens, which tries to abort the marking. The reason is that there's
>> a race between detecting the overflow situation and detecting the
>> abort signal. Threads detecting the overflow situation first will go
>> into the overflow protocol and wait on a barrier for all threads to
>> reach this state. However, threads detecting the abort signal first
>> will terminate and never participate in the barrier.
>>
>> This patch introduces an abort state and function on the
>> WorkGangBarrierSync class, to unblock any threads waiting for the
>> barrier to complete when the concurrent mark is aborted.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8040803
>> Webrev: http://cr.openjdk.java.net/~pliden/8040803/webrev.0/
>>
>> /Per
>