<html><head><style>body{font-family:Helvetica,Arial;font-size:13px}</style></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Mikael,</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">This is how I had implemented it initially (checking the prototype header on the klass). But I somehow got cold feet about it. I’m not 100% sure when the prototype header of a klass changes (I assume during bias revocation, but still…). All the biased locking code is a bit of a mystery to me… Anyway, if you want to give that approach a go and you can do some extensive testing on the change, I’d be willing to give it a go… :-)</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Tony</div> <br><p class="airmail_on">On January 13, 2016 at 4:51:07 AM, Mikael Gerdin (<a href="mailto:mikael.gerdin@oracle.com">mikael.gerdin@oracle.com</a>) wrote:</p> <blockquote type="cite" class="clean_bq"><span><div><div></div><div>Hi Tony,
<br>
<br>On 2016-01-12 19:15, Tony Printezis wrote:
<br>> Thomas,
<br>>
<br>> Inline.
<br>>
<br>> On January 12, 2016 at 7:00:45 AM, Thomas Schatzl
<br>> (thomas.schatzl@oracle.com <mailto:thomas.schatzl@oracle.com>) wrote:
<br>>
<br>>> Hi,
<br>>>
<br>>> On Mon, 2016-01-11 at 12:59 -0500, Tony Printezis wrote:
<br>>> > Hi all,
<br>>> >
<br>>> > We have been recently investigating some very lengthy (several
<br>>> > minutes) promotion failures in ParNew, which also appear in
<br>>> > ParallelGC. We have identified a few issues and have some fixes to
<br>>> > address them. Here's a quick summary:
<br>>> >
<br>>> > 1) There's a scalability bottleneck when adding marks to the
<br>>> > preserved mark stack as there is only one stack, shared by all
<br>>> > workers, and pushes to it are protected by a mutex. This essentially
<br>>> > serializes all workers if there is a non-trivial amount of marks to
<br>>> > be preserved. The fix is similar to what's been implemented in G1 in
<br>>> > JDK 9, which is to introduce per-worker preserved mark stacks.
<br>>> >
<br>>> > 2) (More interestingly) I was perplexed by the huge number of marks
<br>>> > that I see getting preserved during promotion failure. I did a small
<br>>> > study with a test I can reproduce the issue with. The majority of the
<br>>> > preserved marks were 0x5 (i.e. "anonymously biased"). According to
<br>>> > the current logic, no mark is preserved if it's biased, presumably
<br>>> > because it's assumed that the object is biased towards a specific
<br>>> > thread and we want to preserve that mark as it contains the thread
<br>>> > pointer.
<br>>>
<br>>> I think the reason is that nobody ever really measured the impact of
<br>>> biased locking on promotion failures, and so never considered it.
<br>>
<br>>
<br>> I bet. :-)
<br>>
<br>>
<br>>>
<br>>> > The fix is to use a different default mark value when biased locking
<br>>> > is enabled (0x5) or disabled (0x1, as it is now). During promotion
<br>>>
<br>>> > failures, marks are not preserved if they are equal to the default
<br>>> > value and the mark of forwarded objects is set to the default value
<br>>> > post promotion failure and before the preserved marks are re
<br>>> > -instated.
<br>>>
<br>>> You mean the value of the mark as it is set during promotion failure
<br>>> for the new objects?
<br>>
<br>>
<br>> Not sure what you mean by “for new objects”.
<br>>
<br>> Current state: When we encounter promotion failures, we check whether
<br>> the mark is the default (0x1). If it is, we don’t preserve it. If it is
<br>> not, we preserve it. After promotion failure, we iterate over the young
<br>> gen and set the mark of all objects (ParNew) or all forwarded objects
<br>> (ParallelGC) to the default (0x1), then apply all preserved marks.
<br>>
<br>> What I’m proposing is that in the process I just described, the default
<br>> mark will be 0x5, if biased locking is enabled (as most objects will be
<br>> expected to have a 0x5 mark) and 0x1, if biased locking is disabled (as
<br>> it is the case right now).
<br>
<br>I'm thinking that we could check if the current header matches the
<br>prototype header for the class then we would not need to preserve it.
<br>
<br>This would hopefully let us avoid saving/restoring anonymously biased
<br>marks at least.
<br>
<br>/Mikael
<br>
<br>>
<br>>
<br>>>
<br>>> Did some very quick measurements on the distribution of marks on a few
<br>>> certainly also non-representative workloads and can see your point.
<br>>
<br>>
<br>> I also did that for synthetic tests and I see the same. I’ll try to get
<br>> some data from production.
<br>>
<br>>
<br>>>
<br>>> When running without biased locking, the amount of preserved marks is
<br>>> even lower.
<br>>
<br>>
<br>> Of course, because the the most populous mark will be 0x1 when biased
<br>> locking is disabled, not 0x5. The logic of whether to preserve a mark or
<br>> not was taken before biased locking was introduced, when most objects
<br>> would have a 0x1 mark. Biased locking changed this behavior and most
<br>> objects have a 0x5 mark, which invalidated the original assumptions.
<br>>
<br>>
<br>>> That may be an option in some cases in addition to these suggested
<br>>> changes.
<br>>
<br>>
<br>> Not sure what you mean.
<br>>
<br>>
<br>>>
<br>>> > A few extra observations on this:
<br>>> >
<br>>> > - I don't know if the majority of objects we'll come across during
<br>>> > promotion failures will be anonymously biased (it is the case for
<br>>> > synthetic benchmarks). So, the above might pay off in certain cases
<br>>> > but not all. But I think it's still worth doing.
<br>>>
<br>>> I tend to agree since after looking through the biased locking code a
<br>>> bit, it seems that by default new objects are anonymously biased with
<br>>> biased locking on, so this will most likely help decreasing the amount
<br>>> of marks to preserved.
<br>>
<br>>
<br>> Yes, I agree with this.
<br>>
<br>>
<br>>>
<br>>> > - Even though the per-worker preserved mark stacks eliminate the big
<br>>> > scalability bottleneck, reducing (potentially dramatically) the
<br>>> > number of marks that are preserved helps in a couple of ways: a)
<br>>> > avoids allocating a lot of memory for the preserved mark stacks
<br>>> > (which can get very, very large in some cases) and b) avoids having
<br>>> > to scan / reclaim the preserved mark stacks post promotion failure,
<br>>> > which reduces the overall GC time further. Even the parallel time in
<br>>> > ParNew improves by a bit because there are a lot fewer stack pushes
<br>>> > and malloc calls.
<br>>>
<br>>> ... during promotion failure.
<br>>
<br>>
<br>> Yes, I’m sorry I was not clear. ParNew times improve a bit when they
<br>> encounter promotion failures.
<br>>
<br>>
<br>>>
<br>>> > 3) In the case where lots of marks need to be preserved, we found
<br>>> > that using 64K stack segments, instead of 4K segments, speeds up the
<br>>> > preserved mark stack reclamation by a non-trivial amount (it's 3x/4x
<br>>> > faster).
<br>>>
<br>>> In my tests some time ago, increasing stack segment size only helped a
<br>>> little, not 3x/4x times though as reported after implementing the per
<br>>> -thread preserved stacks.
<br>>
<br>>
<br>> To be clear: it’s only the reclamation of the preserved mark stacks I’ve
<br>> seen improve by 3x/4x. Given all the extra work we have to do (remove
<br>> forwarding references, apply preserved marks, etc.) this is a very small
<br>> part of the GC when a promotion failure happens. But, still...
<br>>
<br>>
<br>>>
<br>>> A larger segment size may be a better trade-off for current, larger app
<br>>> lications though.
<br>>
<br>>
<br>> Is there any way to auto-tune the segment size? So, the larger the stack
<br>> grows, the larger the segment size?
<br>>
<br>>
<br>>>
<br>>> > We have fixes for all three issues above for ParNew. We're also going
<br>>> > to implement them for ParallelGC. For JDK 9, 1) is already
<br>>> > implemented, but 2) or 3) might also be worth doing.
<br>>> >
<br>>> > Is there interest in these changes?
<br>>
<br>>
<br>> OK, as I said to Jon, I’ll have the ParNew changes ported to JDK 9 soon.
<br>> Should I create a new CR per GC (ParNew and ParallelGC) for the
<br>> per-worker preserved mark stacks and we’ll take it from there?
<br>>
<br>> Tony
<br>>
<br>>
<br>>>
<br>>> Yes.
<br>>>
<br>>> Thanks,
<br>>> Thomas
<br>>>
<br>>
<br>>
<br>>
<br>>
<br>>
<br>>
<br>>
<br>>
<br>>
<br>>
<br>> -----
<br>>
<br>> Tony Printezis | JVM/GC Engineer / VM Team | Twitter
<br>>
<br>> @TonyPrintezis
<br>> tprintezis@twitter.com <mailto:tprintezis@twitter.com>
<br>>
<br>
<br></div></div></span></blockquote> <div id="bloop_sign_1452700504421904128" class="bloop_sign"><div style="font-family:helvetica,arial;font-size:13px"><div>-----</div><div><br></div><div>Tony Printezis | JVM/GC Engineer / VM Team | Twitter</div><div><br></div><div>@TonyPrintezis</div><div><a href="mailto:tprintezis@twitter.com">tprintezis@twitter.com</a></div><div><br></div></div></div></body></html>