RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Tue Mar 20 16:52:42 UTC 2012

Hi Everyone,

Here's a new webrev based upon the comments below by Tony: 
http://cr.openjdk.java.net/~johnc/7147724/webrev.3/

Testing: The failing test case in a loop overnight (the deadlock 
normally happens after an hour or so); the GC test suite with a low 
marking threshold.

Thanks,

JohnC

On 03/19/12 14:47, Tony Printezis wrote:
> Hi all,
>
> Quick heads-up: I chatted with John about this earlier today. Here are 
> the conclusions of our discussion.
>
> It is possible for a VM op / safepoint to happen while a mutator 
> thread is waiting on the SecondaryFreeList_lock and the wait operation 
> does the safepoint check. However, this can only happen with a non-GC 
> VM op, as GC VM ops first take the Heap_lock before proceeding and 
> said mutator thread is already holding the Heap_lock when it starts 
> waiting on the SecondaryFreeList_lock (i.e., and GC VM op is excluded 
> during that critical section). We think that the critical section is 
> generally OK to be interrupted by a non-GC VM op (since such VM ops 
> should not manipulate the same data structures the mutator thread is 
> manipulating). However, we are a little worried about this and we 
> think it will be error-prone in the future (and such bugs can be very 
> hard to diagnose). Note also that it was a non-GC VM op that was 
> causing the deadlock in the scenarios John came across.
>
> The deadlock that John tracked down involves no less than four threads 
> (!!!). We think that another way to break it is to ensure that the 
> cleanup VM op does not take the pending list lock (PLL). The PLL is 
> only needed if the pause does ref processing. We need it for 
> evacuation pauses, Full GCs, and remark pauses (as we process 
> references during all of those) but not during cleanup pauses. We (OK, 
> let's be honest here: it was my suggestion) did this "to be safe" but 
> unfortunately it turned out to be one of the causes of the deadlock. 
> We think that not taking the PLL during cleanup pauses will resolve 
> the issue. Also, it's simpler to ensure that we don't do ref 
> processing during the cleanup pause (and we're planning to eliminate 
> the cleanup pause anyway and do the cleanup operation entirely 
> concurrently in the future) than to keep track of what data structures 
> are manipulated during non-GC VM ops.
>
> Thanks to John for describing the deadlock in more detail earlier and 
> for his patience in trying to find a better solution to this problem!
>
> Tony
>
> On 03/16/2012 04:57 PM, Tony Printezis wrote:
>> John,
>>
>> Hi. The changes related to the VM operations (i.e. the introduction 
>> and use of the should_retry_gc flag) are fine.
>>
>> I have some concerns about taking the locks with the safepoint check 
>> though.
>>
>> new_region_try_secondary_free_list()
>>
>> is called from:
>> new_region()
>>
>> which (for mutator threads that are allocating) is called from:
>> new_mutator_alloc_region()
>>
>> which in turn is called from the MutatorAllocRegion instance when the 
>> current mutator alloc region needs to be replaced. This is done while 
>> holding the Heap_lock (to make sure the operation is atomic) and I 
>> don't think a safepoint should happen during it. So, will doing the 
>> safepoint check potentially allow a safepoint during this atomic 
>> operation?
>>
>> On the other hard, VM ops (the GC VM ops at least) take the Heap_lock 
>> first before initiating the safepoint. So, doesn't this mean that 
>> whether a mutator thread does or does not the safepoint check when 
>> it's allocating a new region really does not matter given that a 
>> safepoint cannot happen while it's holding the Heap_lock?
>>
>> I might be missing something here...
>>
>> Tony
>>
>> On 03/09/2012 06:54 PM, John Cuthbertson wrote:
>>> Hi Bengt,
>>>
>>> Thanks for looking again. Replies inline....
>>>
>>> On 03/09/12 02:25, Bengt Rutisson wrote:
>>>>
>>>> Hi John,
>>>>
>>>> I think I like this change, but either I am confused or you forgot 
>>>> to remove the "bool no_safepoint_check" from:
>>>>
>>>> G1CollectedHeap::humongous_obj_allocate_find_first()
>>>> G1CollectedHeap::humongous_obj_allocate()
>>>>
>>>> Did you mean to remove it?
>>>
>>> Yes I did mean to remove it. Done.
>>>
>>>>
>>>> Just a style comment on this piece of code:
>>>>
>>>>     bool safepoint_check_flag = Mutex::_no_safepoint_check_flag;
>>>>     if (Thread::current()->is_Java_thread()) {
>>>>       safepoint_check_flag = !Mutex::_no_safepoint_check_flag;
>>>>     }
>>>>
>>>> I understand that we want to use Mutex::_no_safepoint_check_flag 
>>>> since it is defined. But since there is no 
>>>> Mutex::_safepoint_check_flag or similar I think we break the 
>>>> abstraction by doing "safepoint_check_flag = 
>>>> !Mutex::_no_safepoint_check_flag". This kind of exposes that we are 
>>>> dealing with a bool. So, given this I think I would prefer to 
>>>> change the code to just:
>>>>
>>>> safepoint_check_flag = Thread::current()->is_Java_thread();
>>>
>>> OK. Done, but with a small difference. For Java threads we wish to 
>>> pass false so that we don't skip the safepoint check so I used "bool 
>>> no_safepoint_check_flag = !Thread::current()->is_Java_thread();" and 
>>> can then just pass in the value.
>>>
>>> New webrev: http://cr.openjdk.java.net/~johnc/7147724/webrev.2/
>>>
>>> Thanks,
>>>
>>> JohnC