RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL
John Cuthbertson
john.cuthbertson at oracle.com
Mon Mar 5 18:37:42 UTC 2012
Hi Everyone,
Can I have a couple of volunteers to review the changes for this CR? The
webrev can be found at: http://cr.openjdk.java.net/~johnc/7147724/webrev.0/
Summary:
There are a couple of issues, which look like hangs, that the changes in
this CR address.
The first issue is that a thread, while attempting to allocate a
humongous object, would have the initial mark pause not succeed. It
would then continuously retry the pause (which would continously fail).
There are a couple of reasons for this. When several threads, while
attempting to allocate a humongous object, would determine that a
marking cycle was to be initiated - they would race to initiate the
initial mark pause. One thread would win and the losers would end up
failing the VM_G1IncCollectionPause::doit_prologue(). The losers would
then keep retrying to schedule the initial mark pause, and keep failing
in the prologue, while marking was in progress. Similarly the initial
mark pause itself could fail because the GC locker had just become
active. This also had the effect making the requesting thread
continuously retrying to schedule the pause and having it fail while the
GC locker was active. Instrumentation showed that the initial mark pause
was retried several million times.
The solution to this issue were to not retry scheduling the initial mark
pause for a humongous allocation if a marking cycle was already in
progress, and check if the GC locker was active before retrying to
schdule the initial mark pause.
The other issue is that humongous object allocation would check whether
a marking cycle was going to be placing free regions on to the secondary
free list. If so then it would wait on the SecondaryFreeList_lock until
the marking cycle had completed freeing the regions. Unfortunately the
thread allocating the humongous object did not perform a safepoint check
when locking and waiting on the SecondaryFreeList_lock. As a result a
safepoint could be delayed indefinitely: if the SurrogateLockerThread
was already blocked for the safepoint then the concurrent mark cycle may
not be able to complete and so finish the freeing of the regions, which
the allocating thread is waiting on.
The solution for this issue is to perform the safepoint check when
locking/waiting on the SecondaryFreeList_lock during humongous object
allocation.
Testing:
* The hanging nightly tests (6) executing in a loop.
* The GC test suite with G1 and with and without
ExplicitGCInvokesConcurrent on several machines (including a 2-cpu).
* jprt.
More information about the hotspot-gc-dev
mailing list