RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Tue Mar 20 18:35:10 UTC 2012

John,

The new version looks good, thanks for making the changes!!!

Tony

On 03/20/2012 12:52 PM, John Cuthbertson wrote:
> Hi Everyone,
>
> Here's a new webrev based upon the comments below by Tony: 
> http://cr.openjdk.java.net/~johnc/7147724/webrev.3/
>
> Testing: The failing test case in a loop overnight (the deadlock 
> normally happens after an hour or so); the GC test suite with a low 
> marking threshold.
>
> Thanks,
>
> JohnC
>
> On 03/19/12 14:47, Tony Printezis wrote:
>> Hi all,
>>
>> Quick heads-up: I chatted with John about this earlier today. Here 
>> are the conclusions of our discussion.
>>
>> It is possible for a VM op / safepoint to happen while a mutator 
>> thread is waiting on the SecondaryFreeList_lock and the wait 
>> operation does the safepoint check. However, this can only happen 
>> with a non-GC VM op, as GC VM ops first take the Heap_lock before 
>> proceeding and said mutator thread is already holding the Heap_lock 
>> when it starts waiting on the SecondaryFreeList_lock (i.e., and GC VM 
>> op is excluded during that critical section). We think that the 
>> critical section is generally OK to be interrupted by a non-GC VM op 
>> (since such VM ops should not manipulate the same data structures the 
>> mutator thread is manipulating). However, we are a little worried 
>> about this and we think it will be error-prone in the future (and 
>> such bugs can be very hard to diagnose). Note also that it was a 
>> non-GC VM op that was causing the deadlock in the scenarios John came 
>> across.
>>
>> The deadlock that John tracked down involves no less than four 
>> threads (!!!). We think that another way to break it is to ensure 
>> that the cleanup VM op does not take the pending list lock (PLL). The 
>> PLL is only needed if the pause does ref processing. We need it for 
>> evacuation pauses, Full GCs, and remark pauses (as we process 
>> references during all of those) but not during cleanup pauses. We 
>> (OK, let's be honest here: it was my suggestion) did this "to be 
>> safe" but unfortunately it turned out to be one of the causes of the 
>> deadlock. We think that not taking the PLL during cleanup pauses will 
>> resolve the issue. Also, it's simpler to ensure that we don't do ref 
>> processing during the cleanup pause (and we're planning to eliminate 
>> the cleanup pause anyway and do the cleanup operation entirely 
>> concurrently in the future) than to keep track of what data 
>> structures are manipulated during non-GC VM ops.
>>
>> Thanks to John for describing the deadlock in more detail earlier and 
>> for his patience in trying to find a better solution to this problem!
>>
>> Tony
>>
>> On 03/16/2012 04:57 PM, Tony Printezis wrote:
>>> John,
>>>
>>> Hi. The changes related to the VM operations (i.e. the introduction 
>>> and use of the should_retry_gc flag) are fine.
>>>
>>> I have some concerns about taking the locks with the safepoint check 
>>> though.
>>>
>>> new_region_try_secondary_free_list()
>>>
>>> is called from:
>>> new_region()
>>>
>>> which (for mutator threads that are allocating) is called from:
>>> new_mutator_alloc_region()
>>>
>>> which in turn is called from the MutatorAllocRegion instance when 
>>> the current mutator alloc region needs to be replaced. This is done 
>>> while holding the Heap_lock (to make sure the operation is atomic) 
>>> and I don't think a safepoint should happen during it. So, will 
>>> doing the safepoint check potentially allow a safepoint during this 
>>> atomic operation?
>>>
>>> On the other hard, VM ops (the GC VM ops at least) take the 
>>> Heap_lock first before initiating the safepoint. So, doesn't this 
>>> mean that whether a mutator thread does or does not the safepoint 
>>> check when it's allocating a new region really does not matter given 
>>> that a safepoint cannot happen while it's holding the Heap_lock?
>>>
>>> I might be missing something here...
>>>
>>> Tony
>>>
>>> On 03/09/2012 06:54 PM, John Cuthbertson wrote:
>>>> Hi Bengt,
>>>>
>>>> Thanks for looking again. Replies inline....
>>>>
>>>> On 03/09/12 02:25, Bengt Rutisson wrote:
>>>>>
>>>>> Hi John,
>>>>>
>>>>> I think I like this change, but either I am confused or you forgot 
>>>>> to remove the "bool no_safepoint_check" from:
>>>>>
>>>>> G1CollectedHeap::humongous_obj_allocate_find_first()
>>>>> G1CollectedHeap::humongous_obj_allocate()
>>>>>
>>>>> Did you mean to remove it?
>>>>
>>>> Yes I did mean to remove it. Done.
>>>>
>>>>>
>>>>> Just a style comment on this piece of code:
>>>>>
>>>>>     bool safepoint_check_flag = Mutex::_no_safepoint_check_flag;
>>>>>     if (Thread::current()->is_Java_thread()) {
>>>>>       safepoint_check_flag = !Mutex::_no_safepoint_check_flag;
>>>>>     }
>>>>>
>>>>> I understand that we want to use Mutex::_no_safepoint_check_flag 
>>>>> since it is defined. But since there is no 
>>>>> Mutex::_safepoint_check_flag or similar I think we break the 
>>>>> abstraction by doing "safepoint_check_flag = 
>>>>> !Mutex::_no_safepoint_check_flag". This kind of exposes that we 
>>>>> are dealing with a bool. So, given this I think I would prefer to 
>>>>> change the code to just:
>>>>>
>>>>> safepoint_check_flag = Thread::current()->is_Java_thread();
>>>>
>>>> OK. Done, but with a small difference. For Java threads we wish to 
>>>> pass false so that we don't skip the safepoint check so I used 
>>>> "bool no_safepoint_check_flag = 
>>>> !Thread::current()->is_Java_thread();" and can then just pass in 
>>>> the value.
>>>>
>>>> New webrev: http://cr.openjdk.java.net/~johnc/7147724/webrev.2/
>>>>
>>>> Thanks,
>>>>
>>>> JohnC
>