RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Mon Mar 5 18:37:42 UTC 2012

Hi Everyone,

Can I have a couple of volunteers to review the changes for this CR? The 
webrev can be found at: http://cr.openjdk.java.net/~johnc/7147724/webrev.0/

Summary:
There are a couple of issues, which look like hangs, that the changes in 
this CR address.

The first issue is that a thread, while attempting to allocate a 
humongous object, would have the initial mark pause not succeed. It 
would then continuously retry the pause (which would continously fail). 
There are a couple of reasons for this. When several threads, while 
attempting to allocate a humongous object, would determine that a 
marking cycle was to be initiated - they would race to initiate the 
initial mark pause. One thread would win and the losers would end up 
failing the VM_G1IncCollectionPause::doit_prologue(). The losers would 
then keep retrying to schedule the initial mark pause, and keep failing 
in the prologue, while marking was in progress. Similarly the initial 
mark pause itself could fail because the GC locker had just become 
active. This also had the effect making the requesting thread 
continuously retrying to schedule the pause and having it fail while the 
GC locker was active. Instrumentation showed that the initial mark pause 
was retried several million times.

The solution to this issue were to not retry scheduling the initial mark 
pause for a humongous allocation if a marking cycle was already in 
progress, and check if the GC locker was active before retrying to 
schdule the initial mark pause.

The other issue is that humongous object allocation would check whether 
a marking cycle was going to be placing free regions on to the secondary 
free list. If so then it would wait on the SecondaryFreeList_lock until 
the marking cycle had completed freeing the regions. Unfortunately the 
thread allocating the humongous object did not perform a safepoint check 
when locking and waiting on the SecondaryFreeList_lock. As a result a 
safepoint could be delayed indefinitely: if the SurrogateLockerThread 
was already blocked for the safepoint then the concurrent mark cycle may 
not be able to complete and so finish the freeing of the regions, which 
the allocating thread is waiting on.

The solution for this issue is to perform the safepoint check when 
locking/waiting on the SecondaryFreeList_lock during humongous object 
allocation.

Testing:
* The hanging nightly tests (6) executing in a loop.
* The GC test suite with G1 and with and without 
ExplicitGCInvokesConcurrent on several machines (including a 2-cpu).
* jprt.