RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL
Tony Printezis
tony.printezis at oracle.com
Tue Mar 20 18:35:10 UTC 2012
John,
The new version looks good, thanks for making the changes!!!
Tony
On 03/20/2012 12:52 PM, John Cuthbertson wrote:
> Hi Everyone,
>
> Here's a new webrev based upon the comments below by Tony:
> http://cr.openjdk.java.net/~johnc/7147724/webrev.3/
>
> Testing: The failing test case in a loop overnight (the deadlock
> normally happens after an hour or so); the GC test suite with a low
> marking threshold.
>
> Thanks,
>
> JohnC
>
> On 03/19/12 14:47, Tony Printezis wrote:
>> Hi all,
>>
>> Quick heads-up: I chatted with John about this earlier today. Here
>> are the conclusions of our discussion.
>>
>> It is possible for a VM op / safepoint to happen while a mutator
>> thread is waiting on the SecondaryFreeList_lock and the wait
>> operation does the safepoint check. However, this can only happen
>> with a non-GC VM op, as GC VM ops first take the Heap_lock before
>> proceeding and said mutator thread is already holding the Heap_lock
>> when it starts waiting on the SecondaryFreeList_lock (i.e., and GC VM
>> op is excluded during that critical section). We think that the
>> critical section is generally OK to be interrupted by a non-GC VM op
>> (since such VM ops should not manipulate the same data structures the
>> mutator thread is manipulating). However, we are a little worried
>> about this and we think it will be error-prone in the future (and
>> such bugs can be very hard to diagnose). Note also that it was a
>> non-GC VM op that was causing the deadlock in the scenarios John came
>> across.
>>
>> The deadlock that John tracked down involves no less than four
>> threads (!!!). We think that another way to break it is to ensure
>> that the cleanup VM op does not take the pending list lock (PLL). The
>> PLL is only needed if the pause does ref processing. We need it for
>> evacuation pauses, Full GCs, and remark pauses (as we process
>> references during all of those) but not during cleanup pauses. We
>> (OK, let's be honest here: it was my suggestion) did this "to be
>> safe" but unfortunately it turned out to be one of the causes of the
>> deadlock. We think that not taking the PLL during cleanup pauses will
>> resolve the issue. Also, it's simpler to ensure that we don't do ref
>> processing during the cleanup pause (and we're planning to eliminate
>> the cleanup pause anyway and do the cleanup operation entirely
>> concurrently in the future) than to keep track of what data
>> structures are manipulated during non-GC VM ops.
>>
>> Thanks to John for describing the deadlock in more detail earlier and
>> for his patience in trying to find a better solution to this problem!
>>
>> Tony
>>
>> On 03/16/2012 04:57 PM, Tony Printezis wrote:
>>> John,
>>>
>>> Hi. The changes related to the VM operations (i.e. the introduction
>>> and use of the should_retry_gc flag) are fine.
>>>
>>> I have some concerns about taking the locks with the safepoint check
>>> though.
>>>
>>> new_region_try_secondary_free_list()
>>>
>>> is called from:
>>> new_region()
>>>
>>> which (for mutator threads that are allocating) is called from:
>>> new_mutator_alloc_region()
>>>
>>> which in turn is called from the MutatorAllocRegion instance when
>>> the current mutator alloc region needs to be replaced. This is done
>>> while holding the Heap_lock (to make sure the operation is atomic)
>>> and I don't think a safepoint should happen during it. So, will
>>> doing the safepoint check potentially allow a safepoint during this
>>> atomic operation?
>>>
>>> On the other hard, VM ops (the GC VM ops at least) take the
>>> Heap_lock first before initiating the safepoint. So, doesn't this
>>> mean that whether a mutator thread does or does not the safepoint
>>> check when it's allocating a new region really does not matter given
>>> that a safepoint cannot happen while it's holding the Heap_lock?
>>>
>>> I might be missing something here...
>>>
>>> Tony
>>>
>>> On 03/09/2012 06:54 PM, John Cuthbertson wrote:
>>>> Hi Bengt,
>>>>
>>>> Thanks for looking again. Replies inline....
>>>>
>>>> On 03/09/12 02:25, Bengt Rutisson wrote:
>>>>>
>>>>> Hi John,
>>>>>
>>>>> I think I like this change, but either I am confused or you forgot
>>>>> to remove the "bool no_safepoint_check" from:
>>>>>
>>>>> G1CollectedHeap::humongous_obj_allocate_find_first()
>>>>> G1CollectedHeap::humongous_obj_allocate()
>>>>>
>>>>> Did you mean to remove it?
>>>>
>>>> Yes I did mean to remove it. Done.
>>>>
>>>>>
>>>>> Just a style comment on this piece of code:
>>>>>
>>>>> bool safepoint_check_flag = Mutex::_no_safepoint_check_flag;
>>>>> if (Thread::current()->is_Java_thread()) {
>>>>> safepoint_check_flag = !Mutex::_no_safepoint_check_flag;
>>>>> }
>>>>>
>>>>> I understand that we want to use Mutex::_no_safepoint_check_flag
>>>>> since it is defined. But since there is no
>>>>> Mutex::_safepoint_check_flag or similar I think we break the
>>>>> abstraction by doing "safepoint_check_flag =
>>>>> !Mutex::_no_safepoint_check_flag". This kind of exposes that we
>>>>> are dealing with a bool. So, given this I think I would prefer to
>>>>> change the code to just:
>>>>>
>>>>> safepoint_check_flag = Thread::current()->is_Java_thread();
>>>>
>>>> OK. Done, but with a small difference. For Java threads we wish to
>>>> pass false so that we don't skip the safepoint check so I used
>>>> "bool no_safepoint_check_flag =
>>>> !Thread::current()->is_Java_thread();" and can then just pass in
>>>> the value.
>>>>
>>>> New webrev: http://cr.openjdk.java.net/~johnc/7147724/webrev.2/
>>>>
>>>> Thanks,
>>>>
>>>> JohnC
>
More information about the hotspot-gc-dev
mailing list