RFR: JDK-8061259: ParNew promotion failed is serialized on a lock

Mon Nov 17 10:11:13 UTC 2014

On 2014-11-14 23:40, Kim Barrett wrote:
> On Nov 14, 2014, at 4:55 AM, Bengt Rutisson <bengt.rutisson at oracle.com> wrote:
>> On 2014-11-13 23:13, Kim Barrett wrote:
>>> On Nov 12, 2014, at 11:01 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
>>>> When any promotion failure occurs the concurrent CMS collections is
>>>> abandoned and a full collection is done instead.   There is some
>>>> cleanup that is done to get the heap back into a state where the
>>>> full collection can be used.
>>> […]
>>> The obvious first place to look is the caller(s) of par_promote(),
>>> e.g. ParNewGeneration::copy_to_survivor_space[_XXX_undo]().  Promotion
>>> failure is handled there by setting an already existing flag and doing
>>> some additional work.  It seems like it might work to just change the
>>> calls to par_promote() there to be conditional on that existing flag.
>>> This would make the checks of the proposed new flag outside the lock
>>> context redundant.  If it is acceptable to have a few lingering wasted
>>> calls occur (which seems likely to me) then this conditionalization of
>>> calls to par_promote() would entirely eliminate the need for the
>>> proposed new flag.
>> Do you mean something like this?
>>
>> diff --git a/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp b/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
>> --- a/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
>> +++ b/src/share/vm/gc_implementation/parNew/parNewGeneration.cpp
>> @@ -1196,8 +1196,10 @@
>>          return real_forwardee(old);
>>      }
>>
>> +    if (!_promotion_failed) {
>>      new_obj = _next_gen->par_promote(par_scan_state->thread_num(),
>>                                         old, m, sz);
>> +    }
>>
>>      if (new_obj == NULL) {
>>        // promotion failed, forward to self
> Yes, but remember there are two of these places, as there are two
> variants of copy_to_survivor_space_XXX_undo().

Right. I only changed the one used for CMS since that's what we were 
running the reproducer with. If Jungwoo decides to go with this approach 
we should of course make it a complete patch. Possibly also adding a 
flag to control it as Jungwoo did in the original patch.

>
>> I tried this with the reproducer that Jungwoo has. It gives the same good results as the originally suggested patch.
>>
>> [… snipped timing results …]
>> So, I think it is fair to say that it is unnecessary to add the new state variable.
> Thanks for obtaining experimental support for the theory.

The reproducer from Jungwoo is really nice. It is easy to run and 
produce stable results.

>
>> I also agree that the existing flag has the same race behavior as the new one that was proposed. While I agree with Kim that racing on a variable is always scary I think it is pretty safe in this case. We only race on setting it from false to true. And if anybody sees a stale false value they will just take a more expensive path in the code. Nothing will break.
> I'm more inclined to agree now that I've read a lot more code.  A comment
> in the code would be good though.

Agreed.

>
>>> It seems like an even better approach would be to arrange to terminate
>>> the iteration over surviving objects early once promotion failure has
>>> occurred, rather than continuing to perform work that will ultimately
>>> be discarded. That looks to me like a much harder change to make
>>> though.
>> Unfortunately it is not possible to stop iteration over all objects in the young gen. We still need to scan them all to update their references. Even objects we have not yet scanned may have references to objects that we moved before we got the promotion failure. So we can not immediately abandon scanning and go to to a full GC when we hit a promotion failure. First we need to make sure that all objects have their references updated.
> Yes, that's the kind of thing I was worried about, though I obviously
> hadn't thought it through very carefully.
>
> I think this does leave open the possibility of moving the
> _promotion_failed flag check earlier in copy_to_survivor_space_XXX(),
> to also bypass the tenuring query and the possible allocation in
> to_space, instead just always treating as a promotion failure once the
> _promotion_failure flag is true.
>
> The rationale would be that, once promotion failure has occurred, it
> might be better to leave all the remaining unprocessed objects in
> place as promotion failures than (attempt to) copy any of them to
> to_space. If nothing else, it might reduce the number of to_space
> pages being dirtied.
>
> I'm not sure it would be overall beneficial though, as it looks like
> treating more objects as promotion failures could result in a lot more
> mark preservation (which looks to be relatively expensive to do in
> some cases), and processing the preserved marks appears to be
> single-threaded at present. So while I think moving the
> _promotion_failed flag check earlier would work from a correctness
> standpoint, it's not yet clear to me even what the sign of the
> resulting performance change might be.

Yes, there is probably room for more improvement here. I think adding 
the check in the copy_to_survivor_space_XXX_undo() methods is a clear 
improvement in many cases. So, doing that as a first step is a good 
start. We can consider moving it earlier as a future enhancement.

Bengt

>