JDK-8171119: Low-Overhead Heap Profiling
Jeremy Manson
jeremymanson at google.com
Wed Apr 18 04:51:08 UTC 2018
Great, thanks!
Jeremy
On Tue, Apr 17, 2018 at 2:01 PM serguei.spitsyn at oracle.com <
serguei.spitsyn at oracle.com> wrote:
> Hi Jeremy,
>
> We had a private discussion with Jc on this and decided the same.
> We would like to sample all allocations if possible and on a low level.
>
> Thanks,
> Serguei
>
>
> On 4/17/18 12:38, Jeremy Manson wrote:
>
> +1 to sampling anything the thread is allocating. With the bytecode
> rewriting based version of this, I had complaints about missing allocations
> from JNI and APIs like Class.newInstance. (I don't know how the placement
> of the collectors would affect this, but it did matter).
>
> Jeremy
>
> On Thu, Apr 12, 2018 at 2:23 PM JC Beyler <jcbeyler at google.com> wrote:
>
>> Hi Karen,
>>
>> I apologize for sending too many webrevs. I try/tend to iterate fast and
>> move in an iterative fashion. I also try to solve most, if not all, of the
>> current items that are requested in one go. Perhaps I failed in doing that
>> recently? I apologize for that.
>>
>> So I promise to not send a new webrev in this email or until I'm pretty
>> sure I got all the current (And any incoming comment/reviews) handled :-)
>>
>> For the points you brought up:
>> a) What are we sampling? In my mind, I'd rather have the sampler be
>> sampling anything the thread is allocating and not only sample bytecode
>> allocations. It turns out that I was focusing on that first to get it up.
>> As I was stuck in figuring out how to get the VM collector and the sampling
>> collector to co-exist, there was a bit of issues there.
>> - That has been solved by now delaying the posting of a sampled
>> object if a VM collector is present. So now that I've better understood
>> interactions between collectors and when you could post an event, I'm way
>> more able to talk about the feasibility and validity of the next item about
>> bigger objects.
>>
>> b) You bring up an excellent point of if we have a multi-array object
>> or a more complex object (such as a cloned object for example), if the
>> sampler is tripped on an internal allocation, should we send that smaller
>> allocation or should we send the bigger object
>> - Because we get the stacktrace and we only use the oop to figure out
>> GC information about the liveness of the object in our use-case in the
>> JVMTI agent, this changes nothing really in practice. I do see value in
>> sending the multi-array object as a whole to a user.
>> - If that is what you think is best, I can work on getting that
>> supported and the multi-array test would then prove that if part of the
>> multi-array is sampled, the sampler returns the whole multi-array.
>>
>> Hopefully that answers your concern on me sending too many webrevs, to
>> which I sincerely apologize. Probably a learning curve of different
>> approaches of reviews. And I hope that my other answers do show the
>> direction you were hoping to see.
>>
>> Thanks again for all your help,
>> Jc
>>
>> On Thu, Apr 12, 2018 at 8:15 AM Karen Kinnear <karen.kinnear at oracle.com>
>> wrote:
>>
>>> JC,
>>>
>>>
>>> On Apr 11, 2018, at 8:17 PM, JC Beyler <jcbeyler at google.com> wrote:
>>>
>>> Hi Karen,
>>>
>>> I put up a new webrev that is feature complete in my mind in terms of
>>> implementation. There could be a few tid-bits of optimizations here and
>>> there but I believe everything is now there in terms of features and there
>>> is the question of placement of collectors (I've now put it lower than what
>>> you talk about in this email thread).
>>>
>>> I believe that the primary goal of your JEP is to catch samples of
>>> allocation due to bytecodes. Given that, it makes sense to
>>> put the collectors in the code generators, so I am ok with your leaving
>>> them where they are.
>>>
>>> And it would save us all a lot of cycles if rather than frequent webrev
>>> updates with subsets of the requested changes - if you
>>> could wait until you’ve added all the changes people requested - then
>>> that would increase the signal to noise ratio.
>>>
>>>
>>> The incremental webrev is here:
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.12_13/
>>> and the full webrev is here:
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.13/
>>>
>>> The incremental webrev contains the name change of TLAB fields as per
>>> the conversation going on in the GC thread and the Graal changes to make
>>> this webrev not break anything. It also contains a test for when VM and
>>> sampled events are enabled at the same time.
>>>
>>> I'll answer your questions inlined below:
>>>
>>>
>>>>
>>>> I have a couple of design questions before getting to the detailed code
>>>> review:
>>>>
>>>> 1. jvmtiExport.cpp
>>>> You have simplified the JvmtiObjectAllocEventCollector to assume there
>>>> is only a single object.
>>>> Do you have a test that allocates a multi-dimensional array?
>>>> I would expect that to have multiple subarrays - and need the logic
>>>> that you removed.
>>>>
>>>
>>> I "think" you misread this. I did not change the implementation of JvmtiObjectAllocEventCollector
>>> to assume only one object. Actually the implementation is still doing what
>>> it was doing initially but now JvmtiObjectAllocEventCollector is a main
>>> class with two inherited behaviors:
>>> - JvmtiVMObjectAllocEventCollector is following the same logic as
>>> before
>>> - JvmtiSampledObjectAllocEventCollector has the logic of a single
>>> allocation during collection since it is placed that low. I'm thinking of
>>> perhaps separating the two classes now that the sampled case is so low and
>>> does not require the extra logic of handling a growable array.
>>>
>>> I don't have a test that tests multi-dimensional array. I have not
>>> looked at what exactly VMObjectAllocEvents track but I did do an example to
>>> see:
>>>
>>> - On a clean JDK, if I allocate:
>>> int[][][] myAwesomeTable = new int[10][10][10];
>>>
>>> I get one single VMObject call back.
>>>
>>>
>>> - With my change, I get the same behavior.
>>>
>>> So instead, I did an object containing an array and cloned it. With the
>>> clean JDK I get two VM callbacks with and without my change.
>>>
>>> I'll add a test in the next webrev to ensure correctness.
>>>
>>>
>>>> *** Please add a test in which you allocate a multi-dimensional array -
>>>> and try it with your earlier version which
>>>> will show the full set of allocated objects
>>>>
>>>
>>> As I said, this works because it is so low in the system. I don't know
>>> the internals of the allocation of a three-dimensional array but:
>>> - If I try to collect all allocations required for the new
>>> int[10][10][10], I get more than a hundred allocation callbacks if the
>>> sample rate is 0, which is what I'd expect (getting a callback at each
>>> allocation, so what I expect)
>>>
>>> So though I only collect one object, I get a callback for each if that
>>> is what we want in regard to the sampling rate.
>>>
>>> I’ll let you work this one out with Serguei - if he is ok with your
>>> change to not have a growable array and only report one object,
>>> given that you are sampling, sounds like that is not a functional loss
>>> for you.
>>>
>>> And yes, please do add the test.
>>>
>>>
>>>
>>>> 2. Tests - didn’t read them all - just ran into a hardcoded “10% error
>>>> ensures a sanity test without becoming flaky”
>>>> do check with Serguei on this - that looks like a potential future test
>>>> failure
>>>>
>>>
>>> Because the sampling rate is a geometric variable around a mean of the
>>> sampling rate, any meaningful test is going to have to be statistical.
>>> Therefore, I do a bit of error acceptance to allow us to test for the real
>>> thing and not hack the code to have less "real" tests. This is what we do
>>> internally, let me know if you want me to do it otherwise.
>>>
>>> Flaky is probably a wrong term or perhaps I need to better explain this.
>>> I'll change the comment in the tests to explain that potentially flakyness
>>> comes from the nature of the geometrical mean. Because we don't want too
>>> long running tests, it makes sense to me to have this error percentage.
>>>
>>> Let me now answer the comments from the other email here as well so we
>>> have all answers and conversations in a single thread:
>>>
>>>> - Note there is one current caveat: if the agent requests the
>>>> VMObjectAlloc event, the sampler defers to that event due to a limitation
>>>> in its implementation (ie: I am not convinced I can safely send out that
>>>> event with the VM collector enabled, I'll happily white board that).
>>>>
>>>> Please work that one out with Serguei and the serviceability folks.
>>>>
>>>>
>>> Agreed. I'll follow up with Serguei if this is a potential problem, I
>>> have to double check and ensure I am right that there is an issue. I see it
>>> as just a matter of life and not a problem for now. IF you do want both
>>> events, having a sample drop due to this limitation does not invalidate the
>>> system in my mind. I could be wrong about it though and would happily go
>>> over what I saw.
>>>
>>>
>>>
>>>>> So the Heap Sampling Monitoring System used to have more methods. It
>>>>> made sense to have them in a separate category. I now have moved it to the
>>>>> memory category to be consistent and grouped there. I also removed that
>>>>> link btw.
>>>>>
>>>> Thanks.
>>>>
>>>
>>> Actually it did seem weird to put it there since there was only
>>> Allocate/Deallocate, so for now the method is still again in its own
>>> category once again. If someone has a better spot, let me know.
>>>
>>>
>>>>>
>>>>>>
>>>>>> I was trying to figure out a way to put the collectors farther down
>>>>>> the call stack so as to both catch more
>>>>>> cases and to reduce the maintenance burden - i.e. if you were to add
>>>>>> a new code generator, e.g. Graal -
>>>>>> if it were to go through an existing interface, that might be a place
>>>>>> to already have a collector.
>>>>>>
>>>>>> I do not know the Graal sources - I did look at
>>>>>> jvmci/jvmciRuntime.cpp - and it appears that there
>>>>>> are calls to instanceKlass::new_instance,
>>>>>> oopFactory::new_typeArray/new_ObjArray and ArrayKlass::multi-allocate,
>>>>>> so one possibility would be to put hooks in those calls which would
>>>>>> catch many? (I did not do a thorough search)
>>>>>> of the slowpath calls for the bytecodes, and then check the fast
>>>>>> paths in detail.
>>>>>>
>>>>>
>>>>> I'll come to a major issue with the collector and its placement in the
>>>>> next paragraph.
>>>>>
>>>> Still not clear on why you did not move the collectors into
>>>> instanceKlass::new_instance and oopFactory::newtypeArray/newObjArray
>>>> and ArrayKlass::multi-allocate.
>>>>
>>> As I said above, I am ok with your leaving the collectors in the code
>>> generators since that is your focus.
>>>
>>>
>>> I think what was happening is that the collectors would wrap the objects
>>> in handles but references to the originally allocated object would be on
>>> the stack still and would not get updated if required. Due to that issue, I
>>> believe was getting weird bugs.
>>>
>>> Because of this, it seems that any VM collector enabled has to guarantee
>>> that either:
>>> - its path to destruction (and thus posting of events) has no means
>>> of triggering a GC that would move things around (as long as you are in VM
>>> code you should be fine I believe)
>>>
>>> - if GC is occuring, the objects in its internal array are not
>>> somewhere on the stack without a handle around them to be able to be moved
>>> if need by from a GC operation.
>>>
>>>
>>> I'm not convinced this holds in the multithreaded with sampling and VM
>>> collection cases.
>>>
>>> I will let you and Serguei work out whether you have sufficient test
>>> coverage for the multithreaded cases.
>>>
>>> thanks,
>>> Karen
>>>
>>>
>>>
>>>>>
>>>>>>
>>>>>> I had wondered if it made sense to move the hooks even farther down,
>>>>>> into CollectedHeap:obj_allocate and array_allocate.
>>>>>> I do not think so. First reason is that for multidimensional arrays,
>>>>>> ArrayKlass::multi_allocate the outer dimension array would
>>>>>> have an event before storing the inner sub-arrays and I don’t think
>>>>>> we want that exposed, so that won’t work for arrays.
>>>>>>
>>>>>
>>>>> So the major difficulty is that the steps of collection do this:
>>>>>
>>>>> - An object gets allocated and is decided to be sampled
>>>>> - The original pointer placement (where it resides originally in
>>>>> memory) is passed to the collector
>>>>> - Now one important thing of note:
>>>>> (a) In the VM code, until the point where the oop is going to be
>>>>> returned, GC is not yet aware of it
>>>>>
>>>> (b) so the collector can't yet send it out to the user via JVMTI
>>>>> otherwise, the agent could put a weak reference for example
>>>>>
>>>>> I'm a bit fuzzy on this and maybe it's just that there would be more
>>>>> heavy lifting to make this possible but my initial tests seem to show
>>>>> problems when attempting this in the obj_allocate area.
>>>>>
>>>> Not sure what you are seeing here -
>>>>
>>>> Let me state it the way I understand it.
>>>> 1) You can collect the object into internal metadata at allocation
>>>> point - which you already had
>>>> Note: see comment in JvmtiExport.cpp:
>>>> // In the case of the sampled object collector, we don’t want to
>>>> perform the
>>>> // oops_do because the object in the collector is still stored in
>>>> registers
>>>> // on the VM stack
>>>> - so GC will find these objects as roots, once we allow GC to run,
>>>> which should be after the header is initialized
>>>>
>>>> Totally agree with you that you can not post the event in the source
>>>> code in which you allocate the memory - keep reading
>>>>
>>>> 2) event posting:
>>>> - you want to ensure that the object has been fully initialized
>>>> - the object needs to have the header set up - not just the memory
>>>> allocated - so that applies to all objects
>>>> (and that is done in a caller of the allocation code - so it can’t
>>>> be done at the location which does the memory allocation)
>>>> - the example I pointed out was the multianewarray case - all the
>>>> subarrays need to be allocated
>>>>
>>>> - please add a test case for multi-array, so as you experiment with
>>>> where to post the event, you ensure that you can
>>>> access the subarrays (e.g. a 3D array of length 5 has 5 2D arrays as
>>>> subarrays)
>>>>
>>>
>>> Technically it's more than just initialized, it is the fact that you
>>> cannot perform a callback about an object if any object of that thread is
>>> being held by a collector and also in a register/stack space without
>>> protections.
>>>
>>>
>>>
>>>> - prior to setting up the object header information, GC would not
>>>> know about the object
>>>> - was this by chance the issue you ran into?
>>>>
>>>
>>> No I believe the issue I was running into was above where an object on
>>> the stack was pointing to an oop that got moved during a GC due to that
>>> thread doing an event callback.
>>>
>>> Thanks for your help,
>>> Jc
>>>
>>>
>>>> 3) event posting
>>>> - when you post the event to JVMTI
>>>> - in JvmtiObjectAllocEventMark: sets _jobj (object)to_jobject(obj),
>>>> which creates JNIHandles::make_local(_thread, obj)
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> The second reason is that I strongly suspect the scope you want is
>>>>>> bytecodes only. I think once you have added hooks
>>>>>> to all the fast paths and slow paths that this will be pushing the
>>>>>> performance overhead constraints you proposed and
>>>>>> you won’t want to see e.g. internal allocations.
>>>>>>
>>>>>
>>>>> Yes agreed, allocations from bytecodes are mostly our concern
>>>>> generally :)
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>> But I think you need to experiment with the set of allocations (or
>>>>>> possible alternative sets of allocations) you want recorded.
>>>>>>
>>>>>> The hooks I see today include:
>>>>>> Interpreter: (looking at x86 as a sample)
>>>>>> - slowpath in InterpreterRuntime
>>>>>> - fastpath tlab allocation - your new threshold check handles that
>>>>>>
>>>>>
>>>>> Agreed
>>>>>
>>>>>
>>>>>> - allow_shared_alloc (GC specific): for _new isn’t handled
>>>>>>
>>>>>
>>>>> Where is that exactly? I can check why we are not catching it?
>>>>>
>>>>>
>>>>>>
>>>>>> C1
>>>>>> I don’t see changes in c1_Runtime.cpp
>>>>>> note: you also want to look for the fast path
>>>>>>
>>>>>
>>>>> I added the calls to c1_Runtime in the latest webrev, but was still
>>>>> going through testing before pushing it out. I had waited on this one a
>>>>> bit. Fast path would be handled by the threshold check no?
>>>>>
>>>>>
>>>>>> C2: changes in opto/runtime.cpp for slow path
>>>>>> did you also catch the fast path?
>>>>>>
>>>>>
>>>>> Fast path gets handled by the same threshold check, no? Perhaps I've
>>>>> missed something (very likely)?
>>>>>
>>>>>
>>>>>>
>>>>>> 3. Performance -
>>>>>> After you get all the collectors added - you need to rerun the
>>>>>> performance numbers.
>>>>>>
>>>>>
>>>>> Agreed :)
>>>>>
>>>>>
>>>>>>
>>>>>> thanks,
>>>>>> Karen
>>>>>>
>>>>>> On Apr 5, 2018, at 2:15 PM, JC Beyler <jcbeyler at google.com> wrote:
>>>>>>
>>>>>> Thanks Boris and Derek for testing it.
>>>>>>
>>>>>> Yes I was trying to get a new version out that had the tests ported
>>>>>> as well but got sidetracked while trying to add tests and two new features.
>>>>>>
>>>>>> Here is the incremental webrev:
>>>>>>
>>>>>> Here is the full webrev:
>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11/
>>>>>>
>>>>>> Basically, the new tests assert this:
>>>>>> - Only one agent can currently ask for the sampling, I'm currently
>>>>>> seeing if I can push to a next webrev the multi-agent support to start
>>>>>> doing a code freeze on this one
>>>>>> - The event is not thread-enabled, meaning like the
>>>>>> VMObjectAllocationEvent, it's an all or nothing event; same as the
>>>>>> multi-agent, I'm going to see if a future webrev to add the support is a
>>>>>> better idea to freeze this webrev a bit
>>>>>>
>>>>>> There was another item that I added here and I'm unsure this webrev
>>>>>> is stable in debug mode: I added an assertion system to ascertain that all
>>>>>> paths leading to a TLAB slow path (and hence a sampling point) have a
>>>>>> sampling collector ready to post the event if a user wants it. This might
>>>>>> break a few thing in debug mode as I'm working through the kinks of that as
>>>>>> well. However, in release mode, this new webrev passes all the tests in
>>>>>> hotspot/jtreg/serviceability/jvmti/HeapMonitor.
>>>>>>
>>>>>> Let me know what you think,
>>>>>> Jc
>>>>>>
>>>>>> On Thu, Apr 5, 2018 at 4:56 AM Boris Ulasevich <
>>>>>> boris.ulasevich at bell-sw.com> wrote:
>>>>>>
>>>>>>> Hi JC,
>>>>>>>
>>>>>>> I have just checked on arm32: your patch compiles and runs ok.
>>>>>>>
>>>>>>> As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does not
>>>>>>> correspond to actual library name: libHeapMonitorTest.c ->
>>>>>>> libHeapMonitorTest.so
>>>>>>>
>>>>>>> Boris
>>>>>>>
>>>>>>> On 04.04.2018 01:54, White, Derek wrote:
>>>>>>> > Thanks JC,
>>>>>>> >
>>>>>>> > New patch applies cleanly. Compiles and runs (simple test
>>>>>>> programs) on
>>>>>>> > aarch64.
>>>>>>> >
>>>>>>> > * Derek
>>>>>>> >
>>>>>>> > *From:* JC Beyler [mailto:jcbeyler at google.com]
>>>>>>> > *Sent:* Monday, April 02, 2018 1:17 PM
>>>>>>> > *To:* White, Derek <Derek.White at cavium.com>
>>>>>>> > *Cc:* Erik Österlund <erik.osterlund at oracle.com>;
>>>>>>> > serviceability-dev at openjdk.java.net; hotspot-compiler-dev
>>>>>>> > <hotspot-compiler-dev at openjdk.java.net>
>>>>>>> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>>>>>>> >
>>>>>>> > Hi Derek,
>>>>>>> >
>>>>>>> > I know there were a few things that went in that provoked a merge
>>>>>>> > conflict. I worked on it and got it up to date. Sadly my lack of
>>>>>>> > knowledge makes it a full rebase instead of keeping all the
>>>>>>> history.
>>>>>>> > However, with a newly cloned jdk/hs you should now be able to use:
>>>>>>> >
>>>>>>> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
>>>>>>> >
>>>>>>> > The change you are referring to was done with the others so
>>>>>>> perhaps you
>>>>>>> > were unlucky and I forgot it in a webrev and fixed it in another? I
>>>>>>> > don't know but it's been there and I checked, it is here:
>>>>>>> >
>>>>>>> >
>>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
>>>>>>> >
>>>>>>> > I double checked that tlab_end_offset no longer appears in any
>>>>>>> > architecture (as far as I can tell :)).
>>>>>>> >
>>>>>>> > Thanks for testing and let me know if you run into any other
>>>>>>> issues!
>>>>>>> >
>>>>>>> > Jc
>>>>>>> >
>>>>>>> > On Fri, Mar 30, 2018 at 4:24 PM White, Derek <
>>>>>>> Derek.White at cavium.com
>>>>>>> > <mailto:Derek.White at cavium.com>> wrote:
>>>>>>> >
>>>>>>> > Hi Jc,
>>>>>>> >
>>>>>>> > I’ve been having trouble getting your patch to apply
>>>>>>> correctly. I
>>>>>>> > may have based it on the wrong version.
>>>>>>> >
>>>>>>> > In any case, I think there’s a missing update to
>>>>>>> > macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
>>>>>>> > where “JavaThread::tlab_end_offset()” should become
>>>>>>> > “JavaThread::tlab_current_end_offset()”.
>>>>>>> >
>>>>>>> > This should correspond to the other port’s changes in
>>>>>>> > templateTable_<cpu>.cpp files.
>>>>>>> >
>>>>>>> > Thanks!
>>>>>>> > - Derek
>>>>>>> >
>>>>>>> > *From:* hotspot-compiler-dev
>>>>>>> > [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
>>>>>>> > <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On
>>>>>>> Behalf
>>>>>>> > Of *JC Beyler
>>>>>>> > *Sent:* Wednesday, March 28, 2018 11:43 AM
>>>>>>> > *To:* Erik Österlund <erik.osterlund at oracle.com
>>>>>>> > <mailto:erik.osterlund at oracle.com>>
>>>>>>> > *Cc:* serviceability-dev at openjdk.java.net
>>>>>>> > <mailto:serviceability-dev at openjdk.java.net>;
>>>>>>> hotspot-compiler-dev
>>>>>>> > <hotspot-compiler-dev at openjdk.java.net
>>>>>>> > <mailto: <hotspot-compiler-dev at openjdk.java.net>
>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20180418/4dbaa313/attachment-0001.html>
More information about the serviceability-dev
mailing list