RFR: JDK-8061259: ParNew promotion failed is serialized on a lock

Fri Nov 14 22:40:17 UTC 2014

On Nov 14, 2014, at 4:55 AM, Bengt Rutisson <bengt.rutisson at oracle.com> wrote:
> 
> On 2014-11-13 23:13, Kim Barrett wrote:
>> On Nov 12, 2014, at 11:01 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>>> When any promotion failure occurs the concurrent CMS collections is
>>> abandoned and a full collection is done instead.   There is some
>>> cleanup that is done to get the heap back into a state where the
>>> full collection can be used.
>> […]
>> The obvious first place to look is the caller(s) of par_promote(),
>> e.g. ParNewGeneration::copy_to_survivor_space[_XXX_undo]().  Promotion
>> failure is handled there by setting an already existing flag and doing
>> some additional work.  It seems like it might work to just change the
>> calls to par_promote() there to be conditional on that existing flag.
>> This would make the checks of the proposed new flag outside the lock
>> context redundant.  If it is acceptable to have a few lingering wasted
>> calls occur (which seems likely to me) then this conditionalization of
>> calls to par_promote() would entirely eliminate the need for the
>> proposed new flag.
> 
> Do you mean something like this?
> 
> diff --git a/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp b/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
> --- a/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
> +++ b/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
> @@ -1196,8 +1196,10 @@
>         return real_forwardee(old);
>     }
> 
> +    if (!_promotion_failed) {
>     new_obj = _next_gen->par_promote(par_scan_state->thread_num(),
>                                        old, m, sz);
> +    }
> 
>     if (new_obj == NULL) {
>       // promotion failed, forward to self

Yes, but remember there are two of these places, as there are two
variants of copy_to_survivor_space_XXX_undo().

> I tried this with the reproducer that Jungwoo has. It gives the same good results as the originally suggested patch.
> 
> [… snipped timing results …]
> So, I think it is fair to say that it is unnecessary to add the new state variable.

Thanks for obtaining experimental support for the theory.

> I also agree that the existing flag has the same race behavior as the new one that was proposed. While I agree with Kim that racing on a variable is always scary I think it is pretty safe in this case. We only race on setting it from false to true. And if anybody sees a stale false value they will just take a more expensive path in the code. Nothing will break.

I'm more inclined to agree now that I've read a lot more code.  A comment
in the code would be good though.

>> It seems like an even better approach would be to arrange to terminate
>> the iteration over surviving objects early once promotion failure has
>> occurred, rather than continuing to perform work that will ultimately
>> be discarded. That looks to me like a much harder change to make
>> though.
> 
> Unfortunately it is not possible to stop iteration over all objects in the young gen. We still need to scan them all to update their references. Even objects we have not yet scanned may have references to objects that we moved before we got the promotion failure. So we can not immediately abandon scanning and go to to a full GC when we hit a promotion failure. First we need to make sure that all objects have their references updated.

Yes, that's the kind of thing I was worried about, though I obviously
hadn't thought it through very carefully.

I think this does leave open the possibility of moving the
_promotion_failed flag check earlier in copy_to_survivor_space_XXX(),
to also bypass the tenuring query and the possible allocation in
to_space, instead just always treating as a promotion failure once the
_promotion_failure flag is true.

The rationale would be that, once promotion failure has occurred, it
might be better to leave all the remaining unprocessed objects in
place as promotion failures than (attempt to) copy any of them to
to_space. If nothing else, it might reduce the number of to_space
pages being dirtied.

I'm not sure it would be overall beneficial though, as it looks like
treating more objects as promotion failures could result in a lot more
mark preservation (which looks to be relatively expensive to do in
some cases), and processing the preserved marks appears to be
single-threaded at present. So while I think moving the
_promotion_failed flag check earlier would work from a correctness
standpoint, it's not yet clear to me even what the sign of the
resulting performance change might be.