RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL
John Cuthbertson
john.cuthbertson at oracle.com
Tue Mar 20 16:52:42 UTC 2012
Hi Everyone,
Here's a new webrev based upon the comments below by Tony:
http://cr.openjdk.java.net/~johnc/7147724/webrev.3/
Testing: The failing test case in a loop overnight (the deadlock
normally happens after an hour or so); the GC test suite with a low
marking threshold.
Thanks,
JohnC
On 03/19/12 14:47, Tony Printezis wrote:
> Hi all,
>
> Quick heads-up: I chatted with John about this earlier today. Here are
> the conclusions of our discussion.
>
> It is possible for a VM op / safepoint to happen while a mutator
> thread is waiting on the SecondaryFreeList_lock and the wait operation
> does the safepoint check. However, this can only happen with a non-GC
> VM op, as GC VM ops first take the Heap_lock before proceeding and
> said mutator thread is already holding the Heap_lock when it starts
> waiting on the SecondaryFreeList_lock (i.e., and GC VM op is excluded
> during that critical section). We think that the critical section is
> generally OK to be interrupted by a non-GC VM op (since such VM ops
> should not manipulate the same data structures the mutator thread is
> manipulating). However, we are a little worried about this and we
> think it will be error-prone in the future (and such bugs can be very
> hard to diagnose). Note also that it was a non-GC VM op that was
> causing the deadlock in the scenarios John came across.
>
> The deadlock that John tracked down involves no less than four threads
> (!!!). We think that another way to break it is to ensure that the
> cleanup VM op does not take the pending list lock (PLL). The PLL is
> only needed if the pause does ref processing. We need it for
> evacuation pauses, Full GCs, and remark pauses (as we process
> references during all of those) but not during cleanup pauses. We (OK,
> let's be honest here: it was my suggestion) did this "to be safe" but
> unfortunately it turned out to be one of the causes of the deadlock.
> We think that not taking the PLL during cleanup pauses will resolve
> the issue. Also, it's simpler to ensure that we don't do ref
> processing during the cleanup pause (and we're planning to eliminate
> the cleanup pause anyway and do the cleanup operation entirely
> concurrently in the future) than to keep track of what data structures
> are manipulated during non-GC VM ops.
>
> Thanks to John for describing the deadlock in more detail earlier and
> for his patience in trying to find a better solution to this problem!
>
> Tony
>
> On 03/16/2012 04:57 PM, Tony Printezis wrote:
>> John,
>>
>> Hi. The changes related to the VM operations (i.e. the introduction
>> and use of the should_retry_gc flag) are fine.
>>
>> I have some concerns about taking the locks with the safepoint check
>> though.
>>
>> new_region_try_secondary_free_list()
>>
>> is called from:
>> new_region()
>>
>> which (for mutator threads that are allocating) is called from:
>> new_mutator_alloc_region()
>>
>> which in turn is called from the MutatorAllocRegion instance when the
>> current mutator alloc region needs to be replaced. This is done while
>> holding the Heap_lock (to make sure the operation is atomic) and I
>> don't think a safepoint should happen during it. So, will doing the
>> safepoint check potentially allow a safepoint during this atomic
>> operation?
>>
>> On the other hard, VM ops (the GC VM ops at least) take the Heap_lock
>> first before initiating the safepoint. So, doesn't this mean that
>> whether a mutator thread does or does not the safepoint check when
>> it's allocating a new region really does not matter given that a
>> safepoint cannot happen while it's holding the Heap_lock?
>>
>> I might be missing something here...
>>
>> Tony
>>
>> On 03/09/2012 06:54 PM, John Cuthbertson wrote:
>>> Hi Bengt,
>>>
>>> Thanks for looking again. Replies inline....
>>>
>>> On 03/09/12 02:25, Bengt Rutisson wrote:
>>>>
>>>> Hi John,
>>>>
>>>> I think I like this change, but either I am confused or you forgot
>>>> to remove the "bool no_safepoint_check" from:
>>>>
>>>> G1CollectedHeap::humongous_obj_allocate_find_first()
>>>> G1CollectedHeap::humongous_obj_allocate()
>>>>
>>>> Did you mean to remove it?
>>>
>>> Yes I did mean to remove it. Done.
>>>
>>>>
>>>> Just a style comment on this piece of code:
>>>>
>>>> bool safepoint_check_flag = Mutex::_no_safepoint_check_flag;
>>>> if (Thread::current()->is_Java_thread()) {
>>>> safepoint_check_flag = !Mutex::_no_safepoint_check_flag;
>>>> }
>>>>
>>>> I understand that we want to use Mutex::_no_safepoint_check_flag
>>>> since it is defined. But since there is no
>>>> Mutex::_safepoint_check_flag or similar I think we break the
>>>> abstraction by doing "safepoint_check_flag =
>>>> !Mutex::_no_safepoint_check_flag". This kind of exposes that we are
>>>> dealing with a bool. So, given this I think I would prefer to
>>>> change the code to just:
>>>>
>>>> safepoint_check_flag = Thread::current()->is_Java_thread();
>>>
>>> OK. Done, but with a small difference. For Java threads we wish to
>>> pass false so that we don't skip the safepoint check so I used "bool
>>> no_safepoint_check_flag = !Thread::current()->is_Java_thread();" and
>>> can then just pass in the value.
>>>
>>> New webrev: http://cr.openjdk.java.net/~johnc/7147724/webrev.2/
>>>
>>> Thanks,
>>>
>>> JohnC
More information about the hotspot-gc-dev
mailing list