Very slow promotion failures in ParNew / ParallelGC

Wed Jan 13 10:07:56 UTC 2016

Hi,

On Tue, 2016-01-12 at 13:15 -0500, Tony Printezis wrote:
> Thomas,
> 
> Inline.
> 
> On January 12, 2016 at 7:00:45 AM, Thomas Schatzl (
> thomas.schatzl at oracle.com) wrote:
> > 
[...]
> > 
> > > The fix is to use a different default mark value when biased 
> > > locking is enabled (0x5) or disabled (0x1, as it is now). During
> > > promotion failures, marks are not preserved if they are equal to

> > > the default value and the mark of forwarded objects is set to the
> > > default value post promotion failure and before the preserved 
> > > marks are re-instated. 
> > 
> > You mean the value of the mark as it is set during promotion 
> > failure for the new objects? 
> Not sure what you mean by “for new objects”.
> Current state: When we encounter promotion failures, we check whether
> the mark is the default (0x1). If it is, we don’t preserve it. If it
> is not, we preserve it. After promotion failure, we iterate over the
> young gen and set the mark of all objects (ParNew) or all forwarded
> objects (ParallelGC) to the default (0x1), then apply all preserved
> marks.
> What I’m proposing is that in the process I just described, the
> default mark will be 0x5, if biased locking is enabled (as most
> objects will be expected to have a 0x5 mark) and 0x1, if biased
> locking is disabled (as it is the case right now).

As you mentioned, the default value for new objects is typically not
0x1 when biased locking is enabled, but klass()->prototype_header().

Then (as we agree) the promotion failure code only needs to remember
the non-default mark values for later restoring.

One other "problem" seems to be that some evacuation failure recovery
code unconditionally sets the header of the objects that failed
promotion but are not in the preserved headers list to 0x1....

> > When running without biased locking, the amount of preserved marks
> > is even lower.
> Of course, because the the most populous mark will be 0x1 when biased
> locking is disabled, not 0x5. The logic of whether to preserve a mark
> or not was taken before biased locking was introduced, when most
> objects would have a 0x1 mark. Biased locking changed this behavior
> and most objects have a 0x5 mark, which invalidated the original
> assumptions.

Yes.

> >  That may be an option in some cases in addition to these suggested
> > changes. 
> Not sure what you mean.

In some cases, a "fix" to long promotion failure times might be to
disable biased locking - because biased locking may not even be
advantageous in some cases due to its own overhead.

> > > - Even though the per-worker preserved mark stacks eliminate the 
> > > big scalability bottleneck, reducing (potentially dramatically) 
> > > the number of marks that are preserved helps in a couple of ways:

> > > a) 
> > > avoids allocating a lot of memory for the preserved mark stacks 
> > > (which can get very, very large in some cases) and b) avoids 
> > > having to scan / reclaim the preserved mark stacks post promotion
> > > failure, which reduces the overall GC time further. Even the 
> > > parallel time in ParNew improves by a bit because there are a 
> > > lot fewer stack pushes 
> > > and malloc calls. 
> > 
> > ... during promotion failure. 
> Yes, I’m sorry I was not clear. ParNew times improve a bit when they
> encounter promotion failures.
> 
> > 
> > > 3) In the case where lots of marks need to be preserved, we found
> > > that using 64K stack segments, instead of 4K segments, speeds up

> > > the preserved mark stack reclamation by a non-trivial amount 
> > > (it's 3x/4x faster). 
> > 
> > In my tests some time ago, increasing stack segment size only 
> > helped a little, not 3x/4x times though as reported after 
> > implementing the per-thread preserved stacks. 
>
> To be clear: it’s only the reclamation of the preserved mark stacks
> I’ve seen improve by 3x/4x. Given all the extra work we have to do
> (remove forwarding references, apply preserved marks, etc.) this is a
> very small part of the GC when a promotion failure happens. But,
> still...

Okay, my fault, I was reading this as 3x/4x improvement of the entire
promotion failure recovery. Makes sense now.

> > A larger segment size may be a better trade-off for current, larger

> > applications though. 
> Is there any way to auto-tune the segment size? So, the larger the
> stack grows, the larger the segment size?

Could be done, however is not implemented yet. And of course the basic
promotion failure handling code is very different between the
collectors. Volunteers welcome :]

If this is done, I would also somewhat think of trying to allocate
these per-thread blocks from even larger memory areas that can be
disposed even more quickly.

> > > We have fixes for all three issues above for ParNew. We're also 
> > > going 
> > > to implement them for ParallelGC. For JDK 9, 1) is already 
> > > implemented, but 2) or 3) might also be worth doing. 
> > > 
> > > Is there interest in these changes? 
> OK, as I said to Jon, I’ll have the ParNew changes ported to JDK 9
> soon. Should I create a new CR per GC (ParNew and ParallelGC) for the
> per-worker preserved mark stacks and we’ll take it from there?

Please do.

Thanks,
  Thomas