RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Mon Mar 19 21:47:36 UTC 2012

Hi all,

Quick heads-up: I chatted with John about this earlier today. Here are 
the conclusions of our discussion.

It is possible for a VM op / safepoint to happen while a mutator thread 
is waiting on the SecondaryFreeList_lock and the wait operation does the 
safepoint check. However, this can only happen with a non-GC VM op, as 
GC VM ops first take the Heap_lock before proceeding and said mutator 
thread is already holding the Heap_lock when it starts waiting on the 
SecondaryFreeList_lock (i.e., and GC VM op is excluded during that 
critical section). We think that the critical section is generally OK to 
be interrupted by a non-GC VM op (since such VM ops should not 
manipulate the same data structures the mutator thread is manipulating). 
However, we are a little worried about this and we think it will be 
error-prone in the future (and such bugs can be very hard to diagnose). 
Note also that it was a non-GC VM op that was causing the deadlock in 
the scenarios John came across.

The deadlock that John tracked down involves no less than four threads 
(!!!). We think that another way to break it is to ensure that the 
cleanup VM op does not take the pending list lock (PLL). The PLL is only 
needed if the pause does ref processing. We need it for evacuation 
pauses, Full GCs, and remark pauses (as we process references during all 
of those) but not during cleanup pauses. We (OK, let's be honest here: 
it was my suggestion) did this "to be safe" but unfortunately it turned 
out to be one of the causes of the deadlock. We think that not taking 
the PLL during cleanup pauses will resolve the issue. Also, it's simpler 
to ensure that we don't do ref processing during the cleanup pause (and 
we're planning to eliminate the cleanup pause anyway and do the cleanup 
operation entirely concurrently in the future) than to keep track of 
what data structures are manipulated during non-GC VM ops.

Thanks to John for describing the deadlock in more detail earlier and 
for his patience in trying to find a better solution to this problem!

Tony

On 03/16/2012 04:57 PM, Tony Printezis wrote:
> John,
>
> Hi. The changes related to the VM operations (i.e. the introduction 
> and use of the should_retry_gc flag) are fine.
>
> I have some concerns about taking the locks with the safepoint check 
> though.
>
> new_region_try_secondary_free_list()
>
> is called from:
> new_region()
>
> which (for mutator threads that are allocating) is called from:
> new_mutator_alloc_region()
>
> which in turn is called from the MutatorAllocRegion instance when the 
> current mutator alloc region needs to be replaced. This is done while 
> holding the Heap_lock (to make sure the operation is atomic) and I 
> don't think a safepoint should happen during it. So, will doing the 
> safepoint check potentially allow a safepoint during this atomic 
> operation?
>
> On the other hard, VM ops (the GC VM ops at least) take the Heap_lock 
> first before initiating the safepoint. So, doesn't this mean that 
> whether a mutator thread does or does not the safepoint check when 
> it's allocating a new region really does not matter given that a 
> safepoint cannot happen while it's holding the Heap_lock?
>
> I might be missing something here...
>
> Tony
>
> On 03/09/2012 06:54 PM, John Cuthbertson wrote:
>> Hi Bengt,
>>
>> Thanks for looking again. Replies inline....
>>
>> On 03/09/12 02:25, Bengt Rutisson wrote:
>>>
>>> Hi John,
>>>
>>> I think I like this change, but either I am confused or you forgot 
>>> to remove the "bool no_safepoint_check" from:
>>>
>>> G1CollectedHeap::humongous_obj_allocate_find_first()
>>> G1CollectedHeap::humongous_obj_allocate()
>>>
>>> Did you mean to remove it?
>>
>> Yes I did mean to remove it. Done.
>>
>>>
>>> Just a style comment on this piece of code:
>>>
>>>     bool safepoint_check_flag = Mutex::_no_safepoint_check_flag;
>>>     if (Thread::current()->is_Java_thread()) {
>>>       safepoint_check_flag = !Mutex::_no_safepoint_check_flag;
>>>     }
>>>
>>> I understand that we want to use Mutex::_no_safepoint_check_flag 
>>> since it is defined. But since there is no 
>>> Mutex::_safepoint_check_flag or similar I think we break the 
>>> abstraction by doing "safepoint_check_flag = 
>>> !Mutex::_no_safepoint_check_flag". This kind of exposes that we are 
>>> dealing with a bool. So, given this I think I would prefer to change 
>>> the code to just:
>>>
>>> safepoint_check_flag = Thread::current()->is_Java_thread();
>>
>> OK. Done, but with a small difference. For Java threads we wish to 
>> pass false so that we don't skip the safepoint check so I used "bool 
>> no_safepoint_check_flag = !Thread::current()->is_Java_thread();" and 
>> can then just pass in the value.
>>
>> New webrev: http://cr.openjdk.java.net/~johnc/7147724/webrev.2/
>>
>> Thanks,
>>
>> JohnC