RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL
John Cuthbertson
john.cuthbertson at oracle.com
Tue Mar 6 18:45:43 UTC 2012
Hi Bengt,
Thanks for looking at the code changes. Tracking the issues down was
very frustrating - especially dealing with hundreds of millions of lines
of instrumentation output. Responses inline...
On 03/06/12 02:43, Bengt Rutisson wrote:
> I find the name "no_safepoint_check" a bit strange. To me it would be
> more natural to avoid the negation and call it something like
> "do_safepoint_check". But I realize that it is coming from the
> MutexLockerEx naming, so I am ok with leaving it as it is.
I agree about the name. I was going to go with skip_safepoint_check but
as you see went with the name from the MutexLocker class for
consistency. I'm not sure about changing the parameter to the positive
"do_safepoint_check" - we would then be negating it when passed to
MutexLocker and wait() since we do not want to skip the safepoint check:
MutexLockerEx x(SecondaryFreeList_lock, !do_safepoint_check);
when we do indeed what to perform a safepoint check. This doesn't seem
any clearer - IMO.
>
> A question, given that we always pass true for no_safepoint_check when
> we are at a safepoint and false at the only place where we are not at
> a safepoint, would it be possible to avoid passing this value around
> and instead let wait_while_free_regions_coming() look it up?
>
> Something like:
>
> void G1CollectedHeap::wait_while_free_regions_coming() {
> // Most of the time we won't have to wait, so let's do a quick test
> // first before we take the lock.
> if (!free_regions_coming()) {
> return;
> }
>
> if (G1ConcRegionFreeingVerbose) {
> gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
> "waiting for free regions");
> }
>
> {
> bool no_safepoint_check = SafepointSynchronize::is_at_safepoint();
> MutexLockerEx x(SecondaryFreeList_lock, no_safepoint_check);
> while (free_regions_coming()) {
> SecondaryFreeList_lock->wait(no_safepoint_check);
> }
> }
>
> if (G1ConcRegionFreeingVerbose) {
> gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
> "done waiting for free regions");
> }
> }
>
My inclination would be not to rely upon global state. A better (more
accuarate) check could be:
bool no_safepoint_check = !Thread::current()->is_Java_thread();
Basically if the allocating thread is Java thread then do the safepoint
check. I went with the parameter because the calling paths were well
defined. I don't feel strongly about using a parameter so if you are
strongly against it - I'll make the change.
> Regarding, VM_G1IncCollectionPause::doit_prologue(). I am not sure I
> understand the code correctly. If we have a humongous object
> allocation _should_initiate_conc_mark will be true. So, if
> VM_GC_Operation::doit_prologue() fails we will set _should_retry_gc =
> true even for humongous allocation.
>
> I don't see _should_retry_gc ever being set to false, except in the
> constructor of VM_G1IncCollectionPause. If the above happens, will not
> _should_retry_gc always be true for the operation that had
> VM_GC_Operation::doit_prologue() fail once? In that case, couldn't we
> get back to the same issue as before. Several threads constantly
> retrying a collection?
>
In the case(s) when the prologue fails - _should_initiate_conc_mark is
true for two GC causes: _humongous_obj_allocation and
_java_lang_System_gc. I believe in both these cases we should retry the
GC. If the prologue() fails because the gc_counts were different then
another incremental GC was scheduled. In a typical application, there's
a reasonable chance that this incremental GC was not an initial mark
pause. Since the current thread explicitly wants to initiate a marking
cycle, it should retry the GC.
Now suppose the previously scheduled GC was an initial mark pause then
when the current thread successfully executes the prologue, the doit()
routine will see that there is a marking cycle in progress and fail the
pause. In this case if the cause was a humongous object allocation - we
do not retry the GC; with the other cause - the current thread does want
to retry the GC but only after waiting in the doit_epilogue() routine
for the in-progress marking cycle to be over.
In the case when the the prologue or the pause are unsuccessful because
of the GC locker being active then the current thread will retry the GC
- but only after the GC locker is no longer active.
Basically you are correct - with the changes, the current thread will
spin until it sees a successful prologue - which I think is reasonable.
With the old changes it would spin until the marking cycle was complete.
An alternative would be to treat the IM pauses triggered by humongous
object allocations similarly to the explicit-concurrent-gc and wait in
the doit_prologue() routine until the marking cycle was complete. Doing
this (rather than spinning) would at least give the marking thread(s) a
chance to run but I don't think we want to penalize humongous object
allocations that much.
Hopefully that explains the policy. The way it's implemented - I set
_should_retry_gc to the default value of false. If the prologue fails
(and we're attempting to schedule an IM pause) then it gets set to true.
If the pause is not successful because a marking cycle is already in
progress, I look at the cause. If the cause was not for a humongous
allocation then it gets set to true.
Does that answer your questions?
Thanks,
JohnC
More information about the hotspot-gc-dev
mailing list