RFR: JDK-8061259: ParNew promotion failed is serialized on a lock

Thu Nov 13 22:13:15 UTC 2014

On Nov 12, 2014, at 11:01 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
> 
> When any promotion failure occurs the concurrent CMS collections is
> abandoned and a full collection is done instead.   There is some
> cleanup that is done to get the heap back into a state where the
> full collection can be used.

When any promotion failure occurs the concurrent CMS collections is
abandoned and a full collection is done instead.   There is some
cleanup that is done to get the heap back into a state where the
full collection can be used.

That's what I suspected.  Is the same true for other collectors that
use par_promote()? (I think that's just ParallelOld?)  I'm going to
guess so for now.

In that case I wonder what is really going on and whether the proposed
change is really the best solution.

I've not run the reproducer program from the bug report and examined
its behavior, so my current understanding of the problem is based on
the proposed change and the discussion surrounding it.

At a high level, when promotion fails for some object it may take a
long time for the survivor processing to come to a halt and allow
follow-on work to occur.

Based on the proposed change, part of the delay is that there continue
to be promotion attempts, and those attempts are serialized while
each, in turn, re-attempts work that will (probably?)  fail. This is
addressed through the introduction of the new flag, setting it at
promotion failure, and checking it on entry to the lock context to
fail fast and avoid the re-attempted work.

However, during discussion of the proposed change it was stated that,
while checking the new flag on lock context entry is an improvement,
there is still a significant delay with just that change, and that
this is due to contention for the lock. This is the rationale for
introducing flag checks outside the lock context.

This indicates that even after promotion fails the threads involved
continue to make *many* promotion attempts. That leads me to wonder
why we continue slamming par_promote() when we're just going to later
abandon the collection in progress and start over with a full GC. Why
don't we just stop calling par_promote() at all once we've had a
promotion failure?

The obvious first place to look is the caller(s) of par_promote(),
e.g. ParNewGeneration::copy_to_survivor_space[_XXX_undo]().  Promotion
failure is handled there by setting an already existing flag and doing
some additional work.  It seems like it might work to just change the
calls to par_promote() there to be conditional on that existing flag.
This would make the checks of the proposed new flag outside the lock
context redundant.  If it is acceptable to have a few lingering wasted
calls occur (which seems likely to me) then this conditionalization of
calls to par_promote() would entirely eliminate the need for the
proposed new flag.

It seems like an even better approach would be to arrange to terminate
the iteration over surviving objects early once promotion failure has
occurred, rather than continuing to perform work that will ultimately
be discarded. That looks to me like a much harder change to make
though. (For one thing, I'm not sure all the work being done would
actually be discarded.  And there is cleanup to be done to restore the
heap to a state ready for the full GC, and that might be complicated
by having the earlier iteration terminate early.)

Summing up, I currently think conditionalizing the calls to
par_promote() on the existing _promotion_failed flag is a simpler and
better solution than the proposed introduction of a new flag.  (There
may need to be some barriers added to support this; if so, similar
barriers would be needed for the proposed new flag if it is to be
accessed outside the lock context.)