RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Fri Mar 9 10:25:34 UTC 2012

Hi John,

I think I like this change, but either I am confused or you forgot to 
remove the "bool no_safepoint_check" from:

G1CollectedHeap::humongous_obj_allocate_find_first()
G1CollectedHeap::humongous_obj_allocate()

Did you mean to remove it?

Just a style comment on this piece of code:

     bool safepoint_check_flag = Mutex::_no_safepoint_check_flag;
     if (Thread::current()->is_Java_thread()) {
       safepoint_check_flag = !Mutex::_no_safepoint_check_flag;
     }

I understand that we want to use Mutex::_no_safepoint_check_flag since 
it is defined. But since there is no Mutex::_safepoint_check_flag or 
similar I think we break the abstraction by doing "safepoint_check_flag 
= !Mutex::_no_safepoint_check_flag". This kind of exposes that we are 
dealing with a bool. So, given this I think I would prefer to change the 
code to just:

safepoint_check_flag = Thread::current()->is_Java_thread();

Bengt

On 2012-03-08 19:13, John Cuthbertson wrote:
> Hi Everyone,
>
> A new version of theses changes, incorporating changes based upon the 
> suggestions and questions from Bengt, can be found at: 
> http://cr.openjdk.java.net/~johnc/7147724/webrev.1/
>
> Changes in this version include:
> * Identification of another path where a Java thread could lock and 
> wait on the SecondaryFreeList_lock without checking for a safepoint.
> * The determination of whether to perform a safepoint check when 
> locking/waitin on the SecondaryFreeList_lock is made based upon the 
> type of thread - for Java threads a safepoint check is required.
>
> Thanks,
>
> JohnC
>
> On 03/05/12 10:37, John Cuthbertson wrote:
>> Hi Everyone,
>>
>> Can I have a couple of volunteers to review the changes for this CR? 
>> The webrev can be found at: 
>> http://cr.openjdk.java.net/~johnc/7147724/webrev.0/
>>
>> Summary:
>> There are a couple of issues, which look like hangs, that the changes 
>> in this CR address.
>>
>> The first issue is that a thread, while attempting to allocate a 
>> humongous object, would have the initial mark pause not succeed. It 
>> would then continuously retry the pause (which would continously 
>> fail). There are a couple of reasons for this. When several threads, 
>> while attempting to allocate a humongous object, would determine that 
>> a marking cycle was to be initiated - they would race to initiate the 
>> initial mark pause. One thread would win and the losers would end up 
>> failing the VM_G1IncCollectionPause::doit_prologue(). The losers 
>> would then keep retrying to schedule the initial mark pause, and keep 
>> failing in the prologue, while marking was in progress. Similarly the 
>> initial mark pause itself could fail because the GC locker had just 
>> become active. This also had the effect making the requesting thread 
>> continuously retrying to schedule the pause and having it fail while 
>> the GC locker was active. Instrumentation showed that the initial 
>> mark pause was retried several million times.
>>
>> The solution to this issue were to not retry scheduling the initial 
>> mark pause for a humongous allocation if a marking cycle was already 
>> in progress, and check if the GC locker was active before retrying to 
>> schdule the initial mark pause.
>>
>> The other issue is that humongous object allocation would check 
>> whether a marking cycle was going to be placing free regions on to 
>> the secondary free list. If so then it would wait on the 
>> SecondaryFreeList_lock until the marking cycle had completed freeing 
>> the regions. Unfortunately the thread allocating the humongous object 
>> did not perform a safepoint check when locking and waiting on the 
>> SecondaryFreeList_lock. As a result a safepoint could be delayed 
>> indefinitely: if the SurrogateLockerThread was already blocked for 
>> the safepoint then the concurrent mark cycle may not be able to 
>> complete and so finish the freeing of the regions, which the 
>> allocating thread is waiting on.
>>
>> The solution for this issue is to perform the safepoint check when 
>> locking/waiting on the SecondaryFreeList_lock during humongous object 
>> allocation.
>>
>> Testing:
>> * The hanging nightly tests (6) executing in a loop.
>> * The GC test suite with G1 and with and without 
>> ExplicitGCInvokesConcurrent on several machines (including a 2-cpu).
>> * jprt.
>