RFR: JDK-8061259: ParNew promotion failed is serialized on a lock

Fri Nov 14 09:55:07 UTC 2014

Hi Kim,

On 2014-11-13 23:13, Kim Barrett wrote:
> On Nov 12, 2014, at 11:01 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>> When any promotion failure occurs the concurrent CMS collections is
>> abandoned and a full collection is done instead.   There is some
>> cleanup that is done to get the heap back into a state where the
>> full collection can be used.
> When any promotion failure occurs the concurrent CMS collections is
> abandoned and a full collection is done instead.   There is some
> cleanup that is done to get the heap back into a state where the
> full collection can be used.
>
> That's what I suspected.  Is the same true for other collectors that
> use par_promote()? (I think that's just ParallelOld?)  I'm going to
> guess so for now.
>
> In that case I wonder what is really going on and whether the proposed
> change is really the best solution.
>
> I've not run the reproducer program from the bug report and examined
> its behavior, so my current understanding of the problem is based on
> the proposed change and the discussion surrounding it.
>
> At a high level, when promotion fails for some object it may take a
> long time for the survivor processing to come to a halt and allow
> follow-on work to occur.
>
> Based on the proposed change, part of the delay is that there continue
> to be promotion attempts, and those attempts are serialized while
> each, in turn, re-attempts work that will (probably?)  fail. This is
> addressed through the introduction of the new flag, setting it at
> promotion failure, and checking it on entry to the lock context to
> fail fast and avoid the re-attempted work.
>
> However, during discussion of the proposed change it was stated that,
> while checking the new flag on lock context entry is an improvement,
> there is still a significant delay with just that change, and that
> this is due to contention for the lock. This is the rationale for
> introducing flag checks outside the lock context.
>
> This indicates that even after promotion fails the threads involved
> continue to make *many* promotion attempts. That leads me to wonder
> why we continue slamming par_promote() when we're just going to later
> abandon the collection in progress and start over with a full GC. Why
> don't we just stop calling par_promote() at all once we've had a
> promotion failure?
>
> The obvious first place to look is the caller(s) of par_promote(),
> e.g. ParNewGeneration::copy_to_survivor_space[_XXX_undo]().  Promotion
> failure is handled there by setting an already existing flag and doing
> some additional work.  It seems like it might work to just change the
> calls to par_promote() there to be conditional on that existing flag.
> This would make the checks of the proposed new flag outside the lock
> context redundant.  If it is acceptable to have a few lingering wasted
> calls occur (which seems likely to me) then this conditionalization of
> calls to par_promote() would entirely eliminate the need for the
> proposed new flag.

Do you mean something like this?

diff --git a/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp 
b/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
--- a/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
+++ b/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
@@ -1196,8 +1196,10 @@
          return real_forwardee(old);
      }

+    if (!_promotion_failed) {
      new_obj = _next_gen->par_promote(par_scan_state->thread_num(),
                                         old, m, sz);
+    }

      if (new_obj == NULL) {
        // promotion failed, forward to self


I tried this with the reproducer that Jungwoo has. It gives the same 
good results as the originally suggested patch.

Without my patch above I get a 113 seconds long pause when we hit the 
promotion failure.

#7: [GC (Allocation Failure) #7: [ParNew (promotion failed): 
838912K->943744K(943744K), 110,3634457 secs]#8: [CMS#6: 
[CMS-concurrent-sweep: 0,666/111,030 secs] [Times: user=164,88 sys=33,27 
real=111,04 secs]
  (concurrent mode failure): 1048335K->1048575K(1048576K), 2,6529365 
secs] 1567474K->1170279K(1992320K), [Metaspace: 2454K->2454K(1056768K)], 
113,0165630 secs] [Times: user=166,87 sys=33,24 real=113,02 secs]


With the above patch the pause is 5 seconds:

#7: [GC (Allocation Failure) #7: [ParNew#6: 
[CMS-concurrent-abortable-preclean: 0,006/0,306 secs] [Times: user=0,53 
sys=0,07 real=0,31 secs]
  (promotion failed): 838912K->943744K(943744K), 2,5426524 secs]#8: [CMS 
(concurrent mode failure): 1048381K->1048575K(1048576K), 2,3803190 secs] 
1609214K->1170277K(1992320K), [Metaspace: 2454K->2454K(1056768K)], 
4,9231005 secs] [Times: user=11,31 sys=0,91 real=4,92 secs]


So, I think it is fair to say that it is unnecessary to add the new 
state variable.


I also agree that the existing flag has the same race behavior as the 
new one that was proposed. While I agree with Kim that racing on a 
variable is always scary I think it is pretty safe in this case. We only 
race on setting it from false to true. And if anybody sees a stale false 
value they will just take a more expensive path in the code. Nothing 
will break.

>
> It seems like an even better approach would be to arrange to terminate
> the iteration over surviving objects early once promotion failure has
> occurred, rather than continuing to perform work that will ultimately
> be discarded. That looks to me like a much harder change to make
> though. (For one thing, I'm not sure all the work being done would
> actually be discarded.  And there is cleanup to be done to restore the
> heap to a state ready for the full GC, and that might be complicated
> by having the earlier iteration terminate early.)

Unfortunately it is not possible to stop iteration over all objects in 
the young gen. We still need to scan them all to update their 
references. Even objects we have not yet scanned may have references to 
objects that we moved before we got the promotion failure. So we can not 
immediately abandon scanning and go to to a full GC when we hit a 
promotion failure. First we need to make sure that all objects have 
their references updated.

Thanks,
Bengt

>
> Summing up, I currently think conditionalizing the calls to
> par_promote() on the existing _promotion_failed flag is a simpler and
> better solution than the proposed introduction of a new flag.  (There
> may need to be some barriers added to support this; if so, similar
> barriers would be needed for the proposed new flag if it is to be
> accessed outside the lock context.)
>
>