Very slow promotion failures in ParNew / ParallelGC

Tue Jan 12 12:00:41 UTC 2016

Hi,

On Mon, 2016-01-11 at 12:59 -0500, Tony Printezis wrote:
> Hi all,
> 
> We have been recently investigating some very lengthy (several
> minutes) promotion failures in ParNew, which also appear in
> ParallelGC. We have identified a few issues and have some fixes to
> address them. Here's a quick summary:
> 
> 1) There's a scalability bottleneck when adding marks to the
> preserved mark stack as there is only one stack, shared by all
> workers, and pushes to it are protected by a mutex. This essentially
> serializes all workers if there is a non-trivial amount of marks to
> be preserved. The fix is similar to what's been implemented in G1 in
> JDK 9, which is to introduce per-worker preserved mark stacks.
>
> 2) (More interestingly) I was perplexed by the huge number of marks 
> that I see getting preserved during promotion failure. I did a small 
> study with a test I can reproduce the issue with. The majority of the 
> preserved marks were 0x5 (i.e. "anonymously biased"). According to 
> the current logic, no mark is preserved if it's biased, presumably 
> because it's assumed that the object is biased towards a specific 
> thread and we want to preserve that mark as it contains the thread 
> pointer.

I think the reason is that nobody ever really measured the impact of
biased locking on promotion failures, and so never considered it.

> The fix is to use a different default mark value when biased locking
> is enabled (0x5) or disabled (0x1, as it is now). During promotion

> failures, marks are not preserved if they are equal to the default
> value and the mark of forwarded objects is set to the default value
> post promotion failure and before the preserved marks are re
> -instated.

You mean the value of the mark as it is set during promotion failure
for the new objects?

Did some very quick measurements on the distribution of marks on a few
certainly also non-representative workloads and can see your point.

When running without biased locking, the amount of preserved marks is even lower. That may be an option in some cases in addition to these suggested changes.

> A few extra observations on this:
> 
> - I don't know if the majority of objects we'll come across during
> promotion failures will be anonymously biased (it is the case for
> synthetic benchmarks). So, the above might pay off in certain cases
> but not all. But I think it's still worth doing.

I tend to agree since after looking through the biased locking code a
bit, it seems that by default new objects are anonymously biased with
biased locking on, so this will most likely help decreasing the amount
of marks to preserved.

> - Even though the per-worker preserved mark stacks eliminate the big
> scalability bottleneck, reducing (potentially dramatically) the
> number of marks that are preserved helps in a couple of ways: a)
> avoids allocating a lot of memory for the preserved mark stacks
> (which can get very, very large in some cases) and b) avoids having
> to scan / reclaim the preserved mark stacks post promotion failure,
> which reduces the overall GC time further. Even the parallel time in
> ParNew improves by a bit because there are a lot fewer stack pushes
> and malloc calls.

... during promotion failure.

> 3) In the case where lots of marks need to be preserved, we found
> that using 64K stack segments, instead of 4K segments, speeds up the
> preserved mark stack reclamation by a non-trivial amount (it's 3x/4x
> faster).

In my tests some time ago, increasing stack segment size only helped a
little, not 3x/4x times though as reported after implementing the per
-thread preserved stacks.

A larger segment size may be a better trade-off for current, larger app
lications though.

> We have fixes for all three issues above for ParNew. We're also going
> to implement them for ParallelGC. For JDK 9, 1) is already
> implemented, but 2) or 3) might also be worth doing.
> 
> Is there interest in these changes?

Yes.

Thanks,
  Thomas