<div dir="ltr"><div><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <span dir="ltr"><<a href="mailto:tprintezis@twitter.com" target="_blank">tprintezis@twitter.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div style="font-family:Helvetica,Arial;font-size:13px;color:rgb(0,0,0);margin:0px">Hi Jeremy,</div><div style="font-family:Helvetica,Arial;font-size:13px;color:rgb(0,0,0);margin:0px"><br></div><div style="font-family:Helvetica,Arial;font-size:13px;color:rgb(0,0,0);margin:0px">Please see inline.</div><span> <br><p style="color:rgb(0,0,0)">On June 23, 2015 at 7:22:13 PM, Jeremy Manson (<a href="mailto:jeremymanson@google.com" target="_blank">jeremymanson@google.com</a>) wrote:</p> <div><blockquote type="cite" style="color:rgb(0,0,0);font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span><div><div></div><div><div dir="ltr">I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  </div></div></div></span></blockquote></div><p><br></p></span><p>As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)</p></div></blockquote><div><br></div><div>My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely (and, in fact, not work if TLAB support is turned off)?  I might be missing something obvious (and see my response below).</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><span><div><blockquote type="cite" style="color:rgb(0,0,0);font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span><div><div><div dir="ltr"><div>All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free. <span> </span></div></div></div></div></span></blockquote></div><p><br></p></span><p>Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.</p></div></div></blockquote><div>Really?  A memory read and a subtraction?  Which architectures care about that?</div><div><br></div><div>Again, notice that no one has complained about the addition that was added for total bytes allocated per thread.  I note that was actually added in the 6u20 timeframe.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><span><div><blockquote type="cite" style="color:rgb(0,0,0);font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span><div><div><div dir="ltr"><div>Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7,<span> </span></div></div></div></div></span></blockquote></div><p><br></p></span><p>Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?</p></div></div></div></blockquote><div><br></div><div>Total size allocated per thread.  It isn't possible to calculate that from the TLAB because of out-of-TLAB allocation (and hypothetically disabled TLABs).</div><div><br></div><div>For some reason, they never included it in the ThreadMXBean interface, but it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><div><span><div><blockquote type="cite" style="color:rgb(0,0,0);font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span><div><div><div dir="ltr"><div>and no one has complained.</div><div><br></div><div>I'm not worried about the ease of implementation here, because we've already implemented it. <span> </span></div></div></div></div></span></blockquote></div><p><br></p></span><p>Yeah, but someone will have to maintain it moving forward.</p></div></div></div></div></blockquote><div><br></div><div>I've been maintaining it internally to Google for 5 years.  It's actually pretty self-contained.  The only work involved is when they refactor something (so I've had to move it), or when a bug in the existing implementation is discovered.  It is very closely parallel to the TLAB code, which doesn't change much / at all.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><div><p><span></span></p><div><blockquote type="cite" style="color:rgb(0,0,0);font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span><div><div><div dir="ltr"><div>It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.<br><div><div><div><br></div><div>We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice. <span> </span></div></div></div></div></div></div></div></span></blockquote></div><p><br></p><p>You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).</p><p></p></div></div></div></div></blockquote><div><br></div><div>Hey, you have a counter in your allocation path you've never noticed, which none of your code uses.  Pipelining is a wonderful thing.  :)</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><div><p></p><div><span><div><blockquote type="cite" style="color:rgb(0,0,0);font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span><div><div><div dir="ltr"><div><div><div><div>In fact, we could just have the sampling code do nothing, and no one would ever notice.</div><div><br></div><div>Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.</div><div><br></div><div>As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.</div></div></div></div></div></div></div></span></blockquote></div><p><br></p></span><p>Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?</p></div><p></p></div></div></div></div></blockquote><div>(For context: this referred to the hypothetical feature where we can provide a callback that invokes some code from allocation.)</div><div><br></div><div>(It's not actually hypothetical, because we've already implemented it, but let's call it hypothetical for the moment.)</div><div><br></div><div>We invoke native code.  You can't invoke any Java code during allocation, including calling JNI methods, because that would make allocation potentially reentrant, which doesn't work for all sorts of reasons.  The native code doesn't even get passed a JNIEnv * - there is nothing it can do with it without making the VM crash a lot.</div><div><br></div><div>Or, rather, you might be able to do that, but it would take a lot of Hotspot rearchitecting.  When I tried to do it, I realized it would be an extremely deep dive.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><div><div><div><span><div><blockquote type="cite" style="color:rgb(0,0,0);font-family:Helvetica,Arial;font-size:13px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span><div><div><div dir="ltr"><div><div><div><div>  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.</div><br>What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  </div><div><br></div><div>Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").</div></div></div></div></div></div></span></blockquote></div><p><br></p></span><p>Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).</p></div></div></div></div></div></div></blockquote><div><br></div><div>That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).</div><div><br></div><div>But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".</div><div><br></div><div>Another one that jumps to mind is that we don't know whether the outside-of-TLAB path actually passes the sampling threshold, especially if we let users configure the sampling threshold.  So how would we know whether to sample it?</div><div><br></div><div>You also have to keep track of the sampling interval in the code where we allocate new TLABs, in case the sampling threshold is larger than the TLAB size.  That's not a big deal, of course.</div><div><br></div><div>And, every time the TLAB code changes, we have to consider whether / how those changes affect this sampling mechanism.</div><div><br></div><div>I guess my larger point is that there are so many little corner cases with TLAB allocation, including whether it even happens, that basing the sampling strategy around it seems like a cop-out.  And my belief is that the arguments against our strategy don't really hold water, especially given the presence of the per-thread allocation counter that no one noticed.  </div><div><br></div><div>Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck Rasbold).  All we really need is to write an acceptable JEP, to adjust the code based on the changes the community wants, and someone from Oracle willing to say "yes".</div><div><br></div><div>For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).</div><div><br></div><div>Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><div><div><div><p>But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).</p></div></div></div></div></div></div></blockquote><div>I agree that sampling based on size is the right approach.  </div><div><br></div><div>(And your approach is definitely simpler - I don't mean to discount it.  And if that's what it takes to get this feature accepted, we'll do it, but I'll grumble about it.)</div><div><br></div><div>Jeremy <br></div></div></div></div>