Very slow promotion failures in ParNew / ParallelGC

Wed Jan 13 16:11:26 UTC 2016

Thomas,

Thanks for the reply. Inline.

On January 13, 2016 at 5:08:04 AM, Thomas Schatzl (thomas.schatzl at oracle.com) wrote:

Hi, 

On Tue, 2016-01-12 at 13:15 -0500, Tony Printezis wrote: 
> Thomas, 
> 
> Inline. 
> 
> On January 12, 2016 at 7:00:45 AM, Thomas Schatzl ( 
> thomas.schatzl at oracle.com) wrote: 
> > 
[...] 
> > 
> > > The fix is to use a different default mark value when biased 
> > > locking is enabled (0x5) or disabled (0x1, as it is now). During 
> > > promotion failures, marks are not preserved if they are equal to 

> > > the default value and the mark of forwarded objects is set to the 
> > > default value post promotion failure and before the preserved 
> > > marks are re-instated. 
> > 
> > You mean the value of the mark as it is set during promotion 
> > failure for the new objects? 
> Not sure what you mean by “for new objects”. 
> Current state: When we encounter promotion failures, we check whether 
> the mark is the default (0x1). If it is, we don’t preserve it. If it 
> is not, we preserve it. After promotion failure, we iterate over the 
> young gen and set the mark of all objects (ParNew) or all forwarded 
> objects (ParallelGC) to the default (0x1), then apply all preserved 
> marks. 
> What I’m proposing is that in the process I just described, the 
> default mark will be 0x5, if biased locking is enabled (as most 
> objects will be expected to have a 0x5 mark) and 0x1, if biased 
> locking is disabled (as it is the case right now). 

As you mentioned, the default value for new objects is typically not 
0x1 when biased locking is enabled, but klass()->prototype_header(). 

(OK, I now understand what you meant by “new objects”.) Indeed. But that’s not only the case for new objects. I’d guess that most objects will retain their initial mark? Maybe?

Then (as we agree) the promotion failure code only needs to remember 
the non-default mark values for later restoring. 

Indeed.

One other "problem" seems to be that some evacuation failure recovery 
code unconditionally sets the header of the objects that failed 
promotion but are not in the preserved headers list to 0x1.... 

It’d be hard to do otherwise? You’d have to do a look-up on a table to see whether the object’s mark should be set to the default or a stored value. I think, assuming that most objects have a default mark word, setting the mark word of all (forwarded?) objects in the young gen to the default, then apply the (hopefully, small number of) preserved marks afterwards is not unreasonable.

FWIW, it’d be nice if we could completely avoid self-forwarding (and a lot of those problems will just go away…).

> > When running without biased locking, the amount of preserved marks 
> > is even lower. 
> Of course, because the the most populous mark will be 0x1 when biased 
> locking is disabled, not 0x5. The logic of whether to preserve a mark 
> or not was taken before biased locking was introduced, when most 
> objects would have a 0x1 mark. Biased locking changed this behavior 
> and most objects have a 0x5 mark, which invalidated the original 
> assumptions. 

Yes. 

> > That may be an option in some cases in addition to these suggested 
> > changes. 
> Not sure what you mean. 

In some cases, a "fix" to long promotion failure times might be to 
disable biased locking - because biased locking may not even be 
advantageous in some cases due to its own overhead. 

Well, if biased locking doesn’t pay off for an application (and we do have evidence that biased locking might not pay off for our services), then I assume a lot of classes will end up being unbiased and their prototype header set to 0x1 which might prevent the high amount of marks being preserved issue.

> > > - Even though the per-worker preserved mark stacks eliminate the 
> > > big scalability bottleneck, reducing (potentially dramatically) 
> > > the number of marks that are preserved helps in a couple of ways: 

> > > a) 
> > > avoids allocating a lot of memory for the preserved mark stacks 
> > > (which can get very, very large in some cases) and b) avoids 
> > > having to scan / reclaim the preserved mark stacks post promotion 
> > > failure, which reduces the overall GC time further. Even the 
> > > parallel time in ParNew improves by a bit because there are a 
> > > lot fewer stack pushes 
> > > and malloc calls. 
> > 
> > ... during promotion failure. 
> Yes, I’m sorry I was not clear. ParNew times improve a bit when they 
> encounter promotion failures. 
> 
> > 
> > > 3) In the case where lots of marks need to be preserved, we found 
> > > that using 64K stack segments, instead of 4K segments, speeds up 

> > > the preserved mark stack reclamation by a non-trivial amount 
> > > (it's 3x/4x faster). 
> > 
> > In my tests some time ago, increasing stack segment size only 
> > helped a little, not 3x/4x times though as reported after 
> > implementing the per-thread preserved stacks. 
> 
> To be clear: it’s only the reclamation of the preserved mark stacks 
> I’ve seen improve by 3x/4x. Given all the extra work we have to do 
> (remove forwarding references, apply preserved marks, etc.) this is a 
> very small part of the GC when a promotion failure happens. But, 
> still... 

Okay, my fault, I was reading this as 3x/4x improvement of the entire 
promotion failure recovery. Makes sense now. 

No problem!

> > A larger segment size may be a better trade-off for current, larger 

> > applications though. 
> Is there any way to auto-tune the segment size? So, the larger the 
> stack grows, the larger the segment size? 

Could be done, however is not implemented yet. And of course the basic 
promotion failure handling code is very different between the 
collectors. Volunteers welcome :] 

I factored out some of the logic to a PreservedMarks class which can be re-used by all GCs to somewhat cut down on the code replication...

If this is done, I would also somewhat think of trying to allocate 
these per-thread blocks from even larger memory areas that can be 
disposed even more quickly. 

Sure! 

> > > We have fixes for all three issues above for ParNew. We're also 
> > > going 
> > > to implement them for ParallelGC. For JDK 9, 1) is already 
> > > implemented, but 2) or 3) might also be worth doing. 
> > > 
> > > Is there interest in these changes? 
> OK, as I said to Jon, I’ll have the ParNew changes ported to JDK 9 
> soon. Should I create a new CR per GC (ParNew and ParallelGC) for the 
> per-worker preserved mark stacks and we’ll take it from there? 

Please do. 

JDK-8146989 and JDK-8146991. I’ll post a webrev for the first one later today.

Tony

Thanks, 
Thomas 

-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprintezis at twitter.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20160113/3109438a/attachment.htm>