Very slow promotion failures in ParNew / ParallelGC

Thu Jan 14 08:29:43 UTC 2016

Tony,

On 2016-01-13 16:58, Tony Printezis wrote:
> Mikael,
>
> This is how I had implemented it initially (checking the prototype
> header on the klass). But I somehow got cold feet about it. I’m not 100%
> sure when the prototype header of a klass changes (I assume during bias
> revocation, but still…). All the biased locking code is a bit of a
> mystery to me… Anyway, if you want to give that approach a go and you
> can do some extensive testing on the change, I’d be willing to give it a
> go… :-)

I'm fairly certain that it should work ;)
Unfortunately it's difficult to test the correctness of promotion 
failures due to their rarity. Maybe we could devise a way to verify this 
assumption in an instrumented build and run that through a bunch of GC 
testing?

/Mikael

>
> Tony
>
> On January 13, 2016 at 4:51:07 AM, Mikael Gerdin
> (mikael.gerdin at oracle.com <mailto:mikael.gerdin at oracle.com>) wrote:
>
>> Hi Tony,
>>
>> On 2016-01-12 19:15, Tony Printezis wrote:
>> > Thomas,
>> >
>> > Inline.
>> >
>> > On January 12, 2016 at 7:00:45 AM, Thomas Schatzl
>> > (thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>) wrote:
>> >
>> >> Hi,
>> >>
>> >> On Mon, 2016-01-11 at 12:59 -0500, Tony Printezis wrote:
>> >> > Hi all,
>> >> >
>> >> > We have been recently investigating some very lengthy (several
>> >> > minutes) promotion failures in ParNew, which also appear in
>> >> > ParallelGC. We have identified a few issues and have some fixes to
>> >> > address them. Here's a quick summary:
>> >> >
>> >> > 1) There's a scalability bottleneck when adding marks to the
>> >> > preserved mark stack as there is only one stack, shared by all
>> >> > workers, and pushes to it are protected by a mutex. This essentially
>> >> > serializes all workers if there is a non-trivial amount of marks to
>> >> > be preserved. The fix is similar to what's been implemented in G1 in
>> >> > JDK 9, which is to introduce per-worker preserved mark stacks.
>> >> >
>> >> > 2) (More interestingly) I was perplexed by the huge number of marks
>> >> > that I see getting preserved during promotion failure. I did a small
>> >> > study with a test I can reproduce the issue with. The majority of the
>> >> > preserved marks were 0x5 (i.e. "anonymously biased"). According to
>> >> > the current logic, no mark is preserved if it's biased, presumably
>> >> > because it's assumed that the object is biased towards a specific
>> >> > thread and we want to preserve that mark as it contains the thread
>> >> > pointer.
>> >>
>> >> I think the reason is that nobody ever really measured the impact of
>> >> biased locking on promotion failures, and so never considered it.
>> >
>> >
>> > I bet. :-)
>> >
>> >
>> >>
>> >> > The fix is to use a different default mark value when biased locking
>> >> > is enabled (0x5) or disabled (0x1, as it is now). During promotion
>> >>
>> >> > failures, marks are not preserved if they are equal to the default
>> >> > value and the mark of forwarded objects is set to the default value
>> >> > post promotion failure and before the preserved marks are re
>> >> > -instated.
>> >>
>> >> You mean the value of the mark as it is set during promotion failure
>> >> for the new objects?
>> >
>> >
>> > Not sure what you mean by “for new objects”.
>> >
>> > Current state: When we encounter promotion failures, we check whether
>> > the mark is the default (0x1). If it is, we don’t preserve it. If it is
>> > not, we preserve it. After promotion failure, we iterate over the young
>> > gen and set the mark of all objects (ParNew) or all forwarded objects
>> > (ParallelGC) to the default (0x1), then apply all preserved marks.
>> >
>> > What I’m proposing is that in the process I just described, the default
>> > mark will be 0x5, if biased locking is enabled (as most objects will be
>> > expected to have a 0x5 mark) and 0x1, if biased locking is disabled (as
>> > it is the case right now).
>>
>> I'm thinking that we could check if the current header matches the
>> prototype header for the class then we would not need to preserve it.
>>
>> This would hopefully let us avoid saving/restoring anonymously biased
>> marks at least.
>>
>> /Mikael
>>
>> >
>> >
>> >>
>> >> Did some very quick measurements on the distribution of marks on a few
>> >> certainly also non-representative workloads and can see your point.
>> >
>> >
>> > I also did that for synthetic tests and I see the same. I’ll try to get
>> > some data from production.
>> >
>> >
>> >>
>> >> When running without biased locking, the amount of preserved marks is
>> >> even lower.
>> >
>> >
>> > Of course, because the the most populous mark will be 0x1 when biased
>> > locking is disabled, not 0x5. The logic of whether to preserve a mark or
>> > not was taken before biased locking was introduced, when most objects
>> > would have a 0x1 mark. Biased locking changed this behavior and most
>> > objects have a 0x5 mark, which invalidated the original assumptions.
>> >
>> >
>> >> That may be an option in some cases in addition to these suggested
>> >> changes.
>> >
>> >
>> > Not sure what you mean.
>> >
>> >
>> >>
>> >> > A few extra observations on this:
>> >> >
>> >> > - I don't know if the majority of objects we'll come across during
>> >> > promotion failures will be anonymously biased (it is the case for
>> >> > synthetic benchmarks). So, the above might pay off in certain cases
>> >> > but not all. But I think it's still worth doing.
>> >>
>> >> I tend to agree since after looking through the biased locking code a
>> >> bit, it seems that by default new objects are anonymously biased with
>> >> biased locking on, so this will most likely help decreasing the amount
>> >> of marks to preserved.
>> >
>> >
>> > Yes, I agree with this.
>> >
>> >
>> >>
>> >> > - Even though the per-worker preserved mark stacks eliminate the big
>> >> > scalability bottleneck, reducing (potentially dramatically) the
>> >> > number of marks that are preserved helps in a couple of ways: a)
>> >> > avoids allocating a lot of memory for the preserved mark stacks
>> >> > (which can get very, very large in some cases) and b) avoids having
>> >> > to scan / reclaim the preserved mark stacks post promotion failure,
>> >> > which reduces the overall GC time further. Even the parallel time in
>> >> > ParNew improves by a bit because there are a lot fewer stack pushes
>> >> > and malloc calls.
>> >>
>> >> ... during promotion failure.
>> >
>> >
>> > Yes, I’m sorry I was not clear. ParNew times improve a bit when they
>> > encounter promotion failures.
>> >
>> >
>> >>
>> >> > 3) In the case where lots of marks need to be preserved, we found
>> >> > that using 64K stack segments, instead of 4K segments, speeds up the
>> >> > preserved mark stack reclamation by a non-trivial amount (it's 3x/4x
>> >> > faster).
>> >>
>> >> In my tests some time ago, increasing stack segment size only helped a
>> >> little, not 3x/4x times though as reported after implementing the per
>> >> -thread preserved stacks.
>> >
>> >
>> > To be clear: it’s only the reclamation of the preserved mark stacks I’ve
>> > seen improve by 3x/4x. Given all the extra work we have to do (remove
>> > forwarding references, apply preserved marks, etc.) this is a very small
>> > part of the GC when a promotion failure happens. But, still...
>> >
>> >
>> >>
>> >> A larger segment size may be a better trade-off for current, larger app
>> >> lications though.
>> >
>> >
>> > Is there any way to auto-tune the segment size? So, the larger the stack
>> > grows, the larger the segment size?
>> >
>> >
>> >>
>> >> > We have fixes for all three issues above for ParNew. We're also going
>> >> > to implement them for ParallelGC. For JDK 9, 1) is already
>> >> > implemented, but 2) or 3) might also be worth doing.
>> >> >
>> >> > Is there interest in these changes?
>> >
>> >
>> > OK, as I said to Jon, I’ll have the ParNew changes ported to JDK 9 soon.
>> > Should I create a new CR per GC (ParNew and ParallelGC) for the
>> > per-worker preserved mark stacks and we’ll take it from there?
>> >
>> > Tony
>> >
>> >
>> >>
>> >> Yes.
>> >>
>> >> Thanks,
>> >> Thomas
>> >>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > -----
>> >
>> > Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>> >
>> > @TonyPrintezis
>> > tprintezis at twitter.com <mailto:tprintezis at twitter.com>
>> >
>>
> -----
>
> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
>
> @TonyPrintezis
> tprintezis at twitter.com <mailto:tprintezis at twitter.com>
>