JDK-8171119: Low-Overhead Heap Profiling

Wed Mar 28 15:43:28 UTC 2018

Hi all,

I've been working on deflaking the tests mostly and the wording in the
JVMTI spec.

Here is the two incremental webrevs:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/

Here is the total webrev:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/

Here are the notes of this change:
  - Currently the tests pass 100 times in a row, I am working on checking
if they pass 1000 times in a row.
  - The default sampling rate is set to 512k, this is what we use
internally and having a default means that to enable the sampling with the
default, the user only has to do a enable event/disable event via JVMTI
(instead of enable + set sample rate).
  - I deprecated the code that was handling the fast path tlab refill if it
happened since this is now deprecated
      - Though I saw that Graal is still using it so I have to see what
needs to be done there exactly

Finally, using the Dacapo benchmark suite, I noted a 1% overhead for when
the event system is turned on and the callback to the native agent is just
empty. I got a 3% overhead with a 512k sampling rate with the code I put in
the native side of my tests.

Thanks and comments are appreciated,
Jc

On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <jcbeyler at google.com> wrote:

> Hi all,
>
> The incremental webrev update is here:
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/
>
> The full webrev is here:
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/
>
> Major change here is:
>   - I've removed the heapMonitoring.cpp code in favor of just having the
> sampling events as per Serguei's request; I still have to do some overhead
> measurements but the tests prove the concept can work
>        - Most of the tlab code is unchanged, the only major part is that
> now things get sent off to event collectors when used and enabled.
>   - Added the interpreter collectors to handle interpreter execution
>   - Updated the name from SetTlabHeapSampling to SetHeapSampling to be
> more generic
>   - Added a mutex for the thread sampling so that we can initialize an
> internal static array safely
>   - Ported the tests from the old system to this new one
>
> I've also updated the JEP and CSR to reflect these changes:
>  https://bugs.openjdk.java.net/browse/JDK-8194905
>  https://bugs.openjdk.java.net/browse/JDK-8171119
>
> In order to make this have some forward progress, I've removed the heap
> sampling code entirely and now rely entirely on the event sampling system.
> The tests reflect this by using a simplified implementation of what an
> agent could do:
>
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
> (Search for anything mentioning event_storage).
>
> I have not taken the time to port the whole code we had originally in
> heapMonitoring to this. I hesitate only because that code was in C++, I'd
> have to port it to C and this is for tests so perhaps what I have now is
> good enough?
>
> As far as testing goes, I've ported all the relevant tests and then added
> a few:
>    - Turning the system on/off
>    - Testing using various GCs
>    - Testing using the interpreter
>    - Testing the sampling rate
>    - Testing with objects and arrays
>    - Testing with various threads
>
> Finally, as overhead goes, I have the numbers of the system off vs a clean
> build and I have 0% overhead, which is what we'd want. This was using the
> Dacapo benchmarks. I am now preparing to run a version with the events on
> using dacapo and will report back here.
>
> Any comments are welcome :)
> Jc
>
>
>
>
> On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <jcbeyler at google.com> wrote:
>
>> Hi all,
>>
>> I apologize for the delay but I wanted to add an event system and that
>> took a bit longer than expected and I also reworked the code to take into
>> account the deprecation of FastTLABRefill.
>>
>> This update has four parts:
>>
>> A) I moved the implementation from Thread to ThreadHeapSampler inside of
>> Thread. Would you prefer it as a pointer inside of Thread or like this
>> works for you? Second question would be would you rather have an
>> association outside of Thread altogether that tries to remember when
>> threads are live and then we would have something like:
>> ThreadHeapSampler::get_sampling_size(this_thread);
>>
>> I worry about the overhead of this but perhaps it is not too too bad?
>>
>> B) I also have been working on the Allocation event system that sends out
>> a notification at each sampled event. This will be practical when wanting
>> to do something at the allocation point. I'm also looking at if the whole
>> heapMonitoring code could not reside in the agent code and not in the JDK.
>> I'm not convinced but I'm talking to Serguei about it to see/assess :)
>>    - Also added two tests for the new event subsystem
>>
>> C) Removed the slow_path fields inside the TLAB code since now
>> FastTLABRefill is deprecated
>>
>> D) Updated the JVMTI documentation and specification for the methods.
>>
>> So the incremental webrev is here:
>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/
>>
>> and the full webrev is here:
>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10
>>
>> I believe I have updated the various JIRA issues that track this :)
>>
>> Thanks for your input,
>> Jc
>>
>>
>> On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler <jcbeyler at google.com> wrote:
>>
>>> Hi Erik,
>>>
>>> I inlined my answers, which the last one seems to answer Robbin's
>>> concerns about the same thing (adding things to Thread).
>>>
>>> On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund <
>>> erik.osterlund at oracle.com> wrote:
>>>
>>>> Hi JC,
>>>>
>>>> Comments are inlined below.
>>>>
>>>>
>>>> On 2018-02-13 06:18, JC Beyler wrote:
>>>>
>>>> Hi Erik,
>>>>
>>>> Thanks for your answers, I've now inlined my own answers/comments.
>>>>
>>>> I've done a new webrev here:
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
>>>>
>>>> The incremental is here:
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
>>>>
>>>> Note to all:
>>>>   - I've been integrating changes from Erin/Serguei/David comments so
>>>> this webrev incremental is a bit an answer to all comments in one. I
>>>> apologize for that :)
>>>>
>>>>
>>>> On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund <
>>>> erik.osterlund at oracle.com> wrote:
>>>>
>>>>> Hi JC,
>>>>>
>>>>> Sorry for the delayed reply.
>>>>>
>>>>> Inlined answers:
>>>>>
>>>>>
>>>>> On 2018-02-06 00:04, JC Beyler wrote:
>>>>>
>>>>>> Hi Erik,
>>>>>>
>>>>>> (Renaming this to be folded into the newly renamed thread :))
>>>>>>
>>>>>> First off, thanks a lot for reviewing the webrev! I appreciate it!
>>>>>>
>>>>>> I updated the webrev to:
>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
>>>>>>
>>>>>> And the incremental one is here:
>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
>>>>>>
>>>>>> It contains:
>>>>>> - The change for since from 9 to 11 for the jvmti.xml
>>>>>> - The use of the OrderAccess for initialized
>>>>>> - Clearing the oop
>>>>>>
>>>>>> I also have inlined my answers to your comments. The biggest question
>>>>>> will come from the multiple *_end variables. A bit of the logic there
>>>>>> is due to handling the slow path refill vs fast path refill and
>>>>>> checking that the rug was not pulled underneath the slowpath. I
>>>>>> believe that a previous comment was that TlabFastRefill was going to
>>>>>> be deprecated.
>>>>>>
>>>>>> If this is true, we could revert this code a bit and just do a : if
>>>>>> TlabFastRefill is enabled, disable this. And then deprecate that when
>>>>>> TlabFastRefill is deprecated.
>>>>>>
>>>>>> This might simplify this webrev and I can work on a follow-up that
>>>>>> either: removes TlabFastRefill if Robbin does not have the time to do
>>>>>> it or add the support to the assembly side to handle this correctly.
>>>>>> What do you think?
>>>>>>
>>>>>
>>>>> I support removing TlabFastRefill, but I think it is good to not
>>>>> depend on that happening first.
>>>>>
>>>>>
>>>>
>>>> I'm slowly pushing on the FastTLABRefill (
>>>> <https://bugs.openjdk.java.net/browse/JDK-8194084>
>>>> https://bugs.openjdk.java.net/browse/JDK-8194084), I agree on keeping
>>>> both separate for now though so that we can think of both differently
>>>>
>>>>
>>>>
>>>>> Now, below, inlined are my answers:
>>>>>>
>>>>>> On Fri, Feb 2, 2018 at 8:44 AM, Erik Österlund
>>>>>> <erik.osterlund at oracle.com> wrote:
>>>>>>
>>>>>>> Hi JC,
>>>>>>>
>>>>>>> Hope I am reviewing the right version of your work. Here goes...
>>>>>>>
>>>>>>> src/hotspot/share/gc/shared/collectedHeap.inline.hpp:
>>>>>>>
>>>>>>>   159     AllocTracer::send_allocation_outside_tlab(klass, result,
>>>>>>> size *
>>>>>>> HeapWordSize, THREAD);
>>>>>>>   160
>>>>>>>   161     THREAD->tlab().handle_sample(THREAD, result, size);
>>>>>>>   162     return result;
>>>>>>>   163   }
>>>>>>>
>>>>>>> Should not call tlab()->X without checking if (UseTLAB) IMO.
>>>>>>>
>>>>>>> Done!
>>>>>>
>>>>>
>>>>> More about this later.
>>>>>
>>>>>
>>>>>
>>>>>> src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:
>>>>>>>
>>>>>>> So first of all, there seems to quite a few ends. There is an "end",
>>>>>>> a "hard
>>>>>>> end", a "slow path end", and an "actual end". Moreover, it seems
>>>>>>> like the
>>>>>>> "hard end" is actually further away than the "actual end". So the
>>>>>>> "hard end"
>>>>>>> seems like more of a "really definitely actual end" or something. I
>>>>>>> don't
>>>>>>> know about you, but I think it looks kind of messy. In particular, I
>>>>>>> don't
>>>>>>> feel like the name "actual end" reflects what it represents,
>>>>>>> especially when
>>>>>>> there is another end that is behind the "actual end".
>>>>>>>
>>>>>>>   413 HeapWord* ThreadLocalAllocBuffer::hard_end() {
>>>>>>>   414   // Did a fast TLAB refill occur?
>>>>>>>   415   if (_slow_path_end != _end) {
>>>>>>>   416     // Fix up the actual end to be now the end of this TLAB.
>>>>>>>   417     _slow_path_end = _end;
>>>>>>>   418     _actual_end = _end;
>>>>>>>   419   }
>>>>>>>   420
>>>>>>>   421   return _actual_end + alignment_reserve();
>>>>>>>   422 }
>>>>>>>
>>>>>>> I really do not like making getters unexpectedly have these kind of
>>>>>>> side
>>>>>>> effects. It is not expected that when you ask for the "hard end", you
>>>>>>> implicitly update the "slow path end" and "actual end" to new values.
>>>>>>>
>>>>>>> As I said, a lot of this is due to the FastTlabRefill. If I make this
>>>>>> not supporting FastTlabRefill, this goes away. The reason the system
>>>>>> needs to update itself at the get is that you only know at that get if
>>>>>> things have shifted underneath the tlab slow path. I am not sure of
>>>>>> really better names (naming is hard!), perhaps we could do these
>>>>>> names:
>>>>>>
>>>>>> - current_tlab_end       // Either the allocated tlab end or a
>>>>>> sampling point
>>>>>> - last_allocation_address  // The end of the tlab allocation
>>>>>> - last_slowpath_allocated_end  // In case a fast refill occurred the
>>>>>> end might have changed, this is to remember slow vs fast past refills
>>>>>>
>>>>>> the hard_end method can be renamed to something like:
>>>>>> tlab_end_pointer()        // The end of the lab including a bit of
>>>>>> alignment reserved bytes
>>>>>>
>>>>>
>>>>> Those names sound better to me. Could you please provide a mapping
>>>>> from the old names to the new names so I understand which one is which
>>>>> please?
>>>>>
>>>>> This is my current guess of what you are proposing:
>>>>>
>>>>> end -> current_tlab_end
>>>>> actual_end -> last_allocation_address
>>>>> slow_path_end -> last_slowpath_allocated_end
>>>>> hard_end -> tlab_end_pointer
>>>>>
>>>>>
>>>> Yes that is correct, that was what I was proposing.
>>>>
>>>>
>>>>> I would prefer this naming:
>>>>>
>>>>> end -> slow_path_end // the end for taking a slow path; either due to
>>>>> sampling or refilling
>>>>> actual_end -> allocation_end // the end for allocations
>>>>> slow_path_end -> last_slow_path_end // last address for slow_path_end
>>>>> (as opposed to allocation_end)
>>>>> hard_end -> reserved_end // the end of the reserved space of the TLAB
>>>>>
>>>>> About setting things in the getter... that still seems like a very
>>>>> unpleasant thing to me. It would be better to inspect the call hierarchy
>>>>> and explicitly update the ends where they need updating, and assert in the
>>>>> getter that they are in sync, rather than implicitly setting various ends
>>>>> as a surprising side effect in a getter. It looks like the call hierarchy
>>>>> is very small. With my new naming convention, reserved_end() would
>>>>> presumably return _allocation_end + alignment_reserve(), and have an assert
>>>>> checking that _allocation_end == _last_slow_path_allocation_end,
>>>>> complaining that this invariant must hold, and that a caller to this
>>>>> function, such as make_parsable(), must first explicitly synchronize the
>>>>> ends as required, to honor that invariant.
>>>>>
>>>>>
>>>>
>>>> I've renamed the variables to how you preferred it except for the _end
>>>> one. I did:
>>>> current_end
>>>> last_allocation_address
>>>> tlab_end_ptr
>>>>
>>>> The reason is that the architecture dependent code use the thread.hpp
>>>> API and it already has tlab included into the name so it becomes
>>>> tlab_current_end (which is better that tlab_current_tlab_end in my opinion).
>>>>
>>>> I also moved the update into a separate method with a TODO that says to
>>>> remove it when FastTLABRefill is deprecated
>>>>
>>>>
>>>> This looks a lot better now. Thanks.
>>>>
>>>> Note that the following comment now needs updating accordingly in
>>>> threadLocalAllocBuffer.hpp:
>>>>
>>>>   41 //            Heap sampling is performed via the end/actual_end fields.  42 //            actual_end contains the real end of the tlab allocation,  43 //            whereas end can be set to an arbitrary spot in the tlab to  44 //            trip the return and sample the allocation.  45 //            slow_path_end is used to track if a fast tlab refill occured  46 //            between slowpath calls.
>>>>
>>>> There might be other comments too, I have not looked in detail.
>>>>
>>>
>>> This was the only spot that still had an actual_end, I fixed it now.
>>> I'll do a sweep to double check other comments.
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Not sure it's better but before updating the webrev, I wanted to try
>>>>>> to get input/consensus :)
>>>>>>
>>>>>> (Note hard_end was always further off than end).
>>>>>>
>>>>>> src/hotspot/share/prims/jvmti.xml:
>>>>>>>
>>>>>>> 10357       <capabilityfield id="can_sample_heap" since="9">
>>>>>>> 10358         <description>
>>>>>>> 10359           Can sample the heap.
>>>>>>> 10360           If this capability is enabled then the heap sampling
>>>>>>> methods
>>>>>>> can be called.
>>>>>>> 10361         </description>
>>>>>>> 10362       </capabilityfield>
>>>>>>>
>>>>>>> Looks like this capability should not be "since 9" if it gets
>>>>>>> integrated
>>>>>>> now.
>>>>>>>
>>>>>> Updated now to 11, crossing my fingers :)
>>>>>>
>>>>>>
>>>>>> src/hotspot/share/runtime/heapMonitoring.cpp:
>>>>>>>
>>>>>>>   448       if (is_alive->do_object_b(value)) {
>>>>>>>   449         // Update the oop to point to the new object if it is
>>>>>>> still
>>>>>>> alive.
>>>>>>>   450         f->do_oop(&(trace.obj));
>>>>>>>   451
>>>>>>>   452         // Copy the old trace, if it is still live.
>>>>>>>   453         _allocated_traces->at_put(curr_pos++, trace);
>>>>>>>   454
>>>>>>>   455         // Store the live trace in a cache, to be served up on
>>>>>>> /heapz.
>>>>>>>   456         _traces_on_last_full_gc->append(trace);
>>>>>>>   457
>>>>>>>   458         count++;
>>>>>>>   459       } else {
>>>>>>>   460         // If the old trace is no longer live, add it to the
>>>>>>> list of
>>>>>>>   461         // recently collected garbage.
>>>>>>>   462         store_garbage_trace(trace);
>>>>>>>   463       }
>>>>>>>
>>>>>>> In the case where the oop was not live, I would like it to be
>>>>>>> explicitly
>>>>>>> cleared.
>>>>>>>
>>>>>> Done I think how you wanted it. Let me know because I'm not familiar
>>>>>> with the RootAccess API. I'm unclear if I'm doing this right or not so
>>>>>> reviews of these parts are highly appreciated. Robbin had talked of
>>>>>> perhaps later pushing this all into a OopStorage, should I do this now
>>>>>> do you think? Or can that wait a second webrev later down the road?
>>>>>>
>>>>>
>>>>> I think using handles can and should be done later. You can use the
>>>>> Access API now.
>>>>> I noticed that you are missing an #include "oops/access.inline.hpp" in
>>>>> your heapMonitoring.cpp file.
>>>>>
>>>>>
>>>> The missing header is there for me so I don't know, I made sure it is
>>>> present in the latest webrev. Sorry about that.
>>>>
>>>>
>>>>
>>>>> + Did I clear it the way you wanted me to or were you thinking of
>>>>>> something else?
>>>>>>
>>>>>
>>>>> That is precisely how I wanted it to be cleared. Thanks.
>>>>>
>>>>> + Final question here, seems like if I were to want to not do the
>>>>>> f->do_oop directly on the trace.obj, I'd need to do something like:
>>>>>>
>>>>>>     f->do_oop(&value);
>>>>>>     ...
>>>>>>     trace->store_oop(value);
>>>>>>
>>>>>> to update the oop internally. Is that right/is that one of the
>>>>>> advantages of going to the Oopstorage sooner than later?
>>>>>>
>>>>>
>>>>> I think you really want to do the do_oop on the root directly. Is
>>>>> there a particular reason why you would not want to do that?
>>>>> Otherwise, yes - the benefit with using the handle approach is that
>>>>> you do not need to call do_oop explicitly in your code.
>>>>>
>>>>>
>>>> There is no reason except that now we have a load_oop and a
>>>> get_oop_addr, I was not sure what you would think of that.
>>>>
>>>>
>>>> That's fine.
>>>>
>>>>
>>>>
>>>>>
>>>>>> Also I see a lot of concurrent-looking use of the following field:
>>>>>>>   267   volatile bool _initialized;
>>>>>>>
>>>>>>> Please note that the "volatile" qualifier does not help with
>>>>>>> reordering
>>>>>>> here. Reordering between volatile and non-volatile fields is
>>>>>>> completely free
>>>>>>> for both compiler and hardware, except for windows with MSVC, where
>>>>>>> volatile
>>>>>>> semantics is defined to use acquire/release semantics, and the
>>>>>>> hardware is
>>>>>>> TSO. But for the general case, I would expect this field to be
>>>>>>> stored with
>>>>>>> OrderAccess::release_store and loaded with OrderAccess::load_acquire.
>>>>>>> Otherwise it is not thread safe.
>>>>>>>
>>>>>> Because everything is behind a mutex, I wasn't really worried about
>>>>>> this. I have a test that has multiple threads trying to hit this
>>>>>> corner case and it passes.
>>>>>>
>>>>>> However, to be paranoid, I updated it to using the OrderAccess API
>>>>>> now, thanks! Let me know what you think there too!
>>>>>>
>>>>>
>>>>> If it is indeed always supposed to be read and written under a mutex,
>>>>> then I would strongly prefer to have it accessed as a normal non-volatile
>>>>> member, and have an assertion that given lock is held or we are in a
>>>>> safepoint, as we do in many other places. Something like this:
>>>>>
>>>>> assert(HeapMonitorStorage_lock->owned_by_self() ||
>>>>> (SafepointSynchronize::is_at_safepoint() &&
>>>>> Thread::current()->is_VM_thread()), "this should not be accessed
>>>>> concurrently");
>>>>>
>>>>> It would be confusing to people reading the code if there are uses of
>>>>> OrderAccess that are actually always protected under a mutex.
>>>>>
>>>>>
>>>> Thank you for the exact example to be put in the code! I put it around
>>>> each access/assignment of the _initialized method and found one case where
>>>> yes you can touch it and not have the lock. It actually is "ok" because you
>>>> don't act on the storage until later and only when you really want to
>>>> modify the storage (see the object_alloc_do_sample method which calls the
>>>> add_trace method).
>>>>
>>>> But, because of this, I'm going to put the OrderAccess here, I'll do
>>>> some performance numbers later and if there are issues, I might add a
>>>> "unsafe" read and a "safe" one to make it explicit to the reader. But I
>>>> don't think it will come to that.
>>>>
>>>>
>>>> Okay. This double return in heapMonitoring.cpp looks wrong:
>>>>
>>>>  283   bool initialized() {
>>>>  284     return OrderAccess::load_acquire(&_initialized) != 0;
>>>>  285     return _initialized;
>>>>  286   }
>>>>
>>>> Since you said object_alloc_do_sample() is the only place where you do
>>>> not hold the mutex while reading initialized(), I had a closer look at
>>>> that. It looks like in its current shape, the lack of a mutex may lead to a
>>>> memory leak. In particular, it first checks if (initialized()). Let's
>>>> assume this is now true. It then allocates a bunch of stuff, and checks if
>>>> the number of frames were over 0. If they were, it calls
>>>> StackTraceStorage::storage()->add_trace() seemingly hoping that after
>>>> grabbing the lock in there, initialized() will still return true. But it
>>>> could now return false and skip doing anything, in which case the allocated
>>>> stuff will never be freed.
>>>>
>>>
>>> I fixed this now by making add_trace return a boolean and checking for
>>> that. It will be in the next webrev. Thanks, the truth is that in our
>>> implementation the system is always on or off, so this never really occurs
>>> :). In this version though, that is not true and it's important to handle
>>> so thanks again!
>>>
>>>
>>>
>>>>
>>>> So the analysis seems to be that _initialized is only used outside of
>>>> the mutex in once instance, where it is used to perform double-checked
>>>> locking, that actually causes a memory leak.
>>>>
>>>> I am not proposing how to fix that, just raising the issue. If you
>>>> still want to perform this double-checked locking somehow, then the use of
>>>> acquire/release still seems odd. Because the memory ordering restrictions
>>>> of it never comes into play in this particular case. If it ever did, then
>>>> the use of destroy_stuff(); release_store(_initialized, 0) would be broken
>>>> anyway as that would imply that whatever concurrent reader there ever was
>>>> would after reading _initialized with load_acquire() could *never* read the
>>>> data that is concurrently destroyed anyway. I would be biased to think that
>>>> RawAccess<MO_RELAXED>::load/store looks like a more appropriate solution,
>>>> given that the memory leak issue is resolved. I do not know how painful it
>>>> would be to not perform this double-checked locking.
>>>>
>>>
>>> So I agree with this entirely. I looked also a bit more and the
>>> difference and code really stems from our internal version. In this version
>>> however, there are actually a lot of things going on that I did not go
>>> entirely through in my head but this comment made me ponder a bit more on
>>> it.
>>>
>>> Since every object_alloc_do_sample is protected by a check to
>>> HeapMonitoring::enabled(), there is only a small chance that the call is
>>> happening when things have been disabled. So there is no real need to do a
>>> first check on the initialized, it is a rare occurence that a call happens
>>> to object_alloc_do_sample and the initialized of the storage returns false.
>>>
>>> (By the way, even if you did call object_alloc_do_sample without looking
>>> at HeapMonitoring::enabled(), that would be ok too. You would gather the
>>> stacktrace and get nowhere at the add_trace call, which would return false;
>>> so though not optimal performance wise, nothing would break).
>>>
>>> Furthermore, the add_trace is really the moment of no return and we have
>>> the mutex lock and then the initialized check. So, in the end, I did two
>>> things: I removed that first check and then I removed the OrderAccess for
>>> the storage initialized. I think now I have a better grasp and
>>> understanding why it was done in our code and why it is not needed here.
>>> Thanks for pointing it out :). This now still passes my JTREG tests,
>>> especially the threaded one.
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> As a kind of meta comment, I wonder if it would make sense to add
>>>>>>> sampling
>>>>>>> for non-TLAB allocations. Seems like if someone is rapidly
>>>>>>> allocating a
>>>>>>> whole bunch of 1 MB objects that never fit in a TLAB, I might still
>>>>>>> be
>>>>>>> interested in seeing that in my traces, and not get surprised that
>>>>>>> the
>>>>>>> allocation rate is very high yet not showing up in any profiles.
>>>>>>>
>>>>>>> That is handled by the handle_sample where you wanted me to put a
>>>>>> UseTlab because you hit that case if the allocation is too big.
>>>>>>
>>>>>
>>>>> I see. It was not obvious to me that non-TLAB sampling is done in the
>>>>> TLAB class. That seems like an abstraction crime.
>>>>> What I wanted in my previous comment was that we do not call into the
>>>>> TLAB when we are not using TLABs. If there is sampling logic in the TLAB
>>>>> that is used for something else than TLABs, then it seems like that logic
>>>>> simply does not belong inside of the TLAB. It should be moved out of the
>>>>> TLAB, and instead have the TLAB call this common abstraction that makes
>>>>> sense.
>>>>>
>>>>>
>>>> So in the incremental version:
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/, this is
>>>> still a "crime". The reason is that the system has to have the
>>>> bytes_until_sample on a per-thread level and it made "sense" to have it
>>>> with the TLAB implementation. Also, I was not sure how people felt about
>>>> adding something to the thread instance instead.
>>>>
>>>> Do you think it fits better at the Thread level? I can see how
>>>> difficult it is to make it happen there and add some logic there. Let me
>>>> know what you think.
>>>>
>>>>
>>>> We have an unfortunate situation where everyone that has some fields
>>>> that are thread local tend to dump them right into Thread, making the size
>>>> and complexity of Thread grow as it becomes tightly coupled with various
>>>> unrelated subsystems. It would be desirable to have a separate class for
>>>> this instead that encapsulates the sampling logic. That class could
>>>> possibly reside in Thread though as a value object of Thread.
>>>>
>>>
>>> I imagined that would be the case but was not sure. I will look at the
>>> example that Robbin is talking about (ThreadSMR) and will see how to
>>> refactor my code to use that.
>>>
>>> Thanks again for your help,
>>> Jc
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hope I have answered your questions and that my feedback makes sense
>>>>> to you.
>>>>>
>>>>>
>>>> You have and thank you for them, I think we are getting to a cleaner
>>>> implementation and things are getting better and more readable :)
>>>>
>>>>
>>>> Yes it is getting better.
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>>
>>>> Thanks for your help!
>>>> Jc
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>> /Erik
>>>>>
>>>>>
>>>>> I double checked by changing the test
>>>>>>
>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java
>>>>>>
>>>>>> to use a smaller Tlab (2048) and made the object bigger and it goes
>>>>>> through that and passes.
>>>>>>
>>>>>> Thanks again for your review and I look forward to your pointers for
>>>>>> the questions I now have raised!
>>>>>> Jc
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>> /Erik
>>>>>>>
>>>>>>>
>>>>>>> On 2018-01-26 06:45, JC Beyler wrote:
>>>>>>>
>>>>>>>> Thanks Robbin for the reviews :)
>>>>>>>>
>>>>>>>> The new full webrev is here:
>>>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/
>>>>>>>> The incremental webrev is here:
>>>>>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/
>>>>>>>>
>>>>>>>> I inlined my answers:
>>>>>>>>
>>>>>>>> On Thu, Jan 25, 2018 at 1:15 AM, Robbin Ehn <
>>>>>>>> <robbin.ehn at oracle.com>robbin.ehn at oracle.com> wrote:
>>>>>>>>
>>>>>>>>> Hi JC, great to see another revision!
>>>>>>>>>
>>>>>>>>> ####
>>>>>>>>> heapMonitoring.cpp
>>>>>>>>>
>>>>>>>>> StackTraceData should not contain the oop for 'safety' reasons.
>>>>>>>>> When StackTraceData is moved from _allocated_traces:
>>>>>>>>> L452 store_garbage_trace(trace);
>>>>>>>>> it contains a dead oop.
>>>>>>>>> _allocated_traces could instead be a tupel of oop and
>>>>>>>>> StackTraceData thus
>>>>>>>>> dead oops are not kept.
>>>>>>>>>
>>>>>>>> Done I used inheritance to make the copier work regardless but the
>>>>>>>> idea is the same.
>>>>>>>>
>>>>>>>> You should use the new Access API for loading the oop, something
>>>>>>>>> like
>>>>>>>>> this:
>>>>>>>>> RootAccess<ON_PHANTOM_OOP_REF | AS_NO_KEEPALIVE>::load(...)
>>>>>>>>> I don't think you need to use Access API for clearing the oop, but
>>>>>>>>> it
>>>>>>>>> would
>>>>>>>>> look nicer. And you shouldn't probably be using:
>>>>>>>>> Universe::heap()->is_in_reserved(value)
>>>>>>>>>
>>>>>>>> I am unfamiliar with this but I think I did do it like you wanted me
>>>>>>>> to (all tests pass so that's a start). I'm not sure how to clear the
>>>>>>>> oop exactly, is there somewhere that does that, which I can use to
>>>>>>>> do
>>>>>>>> the same?
>>>>>>>>
>>>>>>>> I removed the is_in_reserved, this came from our internal version, I
>>>>>>>> don't know why it was there but my tests work without so I removed
>>>>>>>> it
>>>>>>>> :)
>>>>>>>>
>>>>>>>>
>>>>>>>> The lock:
>>>>>>>>> L424   MutexLocker mu(HeapMonitorStorage_lock);
>>>>>>>>> Is not needed as far as I can see.
>>>>>>>>> weak_oops_do is called in a safepoint, no TLAB allocation can
>>>>>>>>> happen and
>>>>>>>>> JVMTI thread can't access these data-structures. Is there
>>>>>>>>> something more
>>>>>>>>> to
>>>>>>>>> this lock that I'm missing?
>>>>>>>>>
>>>>>>>> Since a thread can call the JVMTI getLiveTraces (or any of the other
>>>>>>>> ones), it can get to the point of trying to copying the
>>>>>>>> _allocated_traces. I imagine it is possible that this is happening
>>>>>>>> during a GC or that it can be started and a GC happens afterwards.
>>>>>>>> Therefore, it seems to me that you want this protected, no?
>>>>>>>>
>>>>>>>>
>>>>>>>> ####
>>>>>>>>> You have 6 files without any changes in them (any more):
>>>>>>>>> g1CollectedHeap.cpp
>>>>>>>>> psMarkSweep.cpp
>>>>>>>>> psParallelCompact.cpp
>>>>>>>>> genCollectedHeap.cpp
>>>>>>>>> referenceProcessor.cpp
>>>>>>>>> thread.hpp
>>>>>>>>>
>>>>>>>>> Done.
>>>>>>>>
>>>>>>>> ####
>>>>>>>>> I have not looked closely, but is it possible to hide heap
>>>>>>>>> sampling in
>>>>>>>>> AllocTracer ? (with some minor changes to the AllocTracer API)
>>>>>>>>>
>>>>>>>>> I am imagining that you are saying to move the code that does the
>>>>>>>> sampling code (change the tlab end, do the call to HeapMonitoring,
>>>>>>>> etc.) into the AllocTracer code itself? I think that is right and
>>>>>>>> I'll
>>>>>>>> look if that is possible and prepare a webrev to show what would be
>>>>>>>> needed to make that happen.
>>>>>>>>
>>>>>>>> ####
>>>>>>>>> Minor nit, when declaring pointer there is a little mix of having
>>>>>>>>> the
>>>>>>>>> pointer adjacent by type name and data name. (Most hotspot code is
>>>>>>>>> by
>>>>>>>>> type
>>>>>>>>> name)
>>>>>>>>> E.g.
>>>>>>>>> heapMonitoring.cpp:711     jvmtiStackTrace *trace = ....
>>>>>>>>> heapMonitoring.cpp:733         Method* m = vfst.method();
>>>>>>>>> (not just this file)
>>>>>>>>>
>>>>>>>>> Done!
>>>>>>>>
>>>>>>>> ####
>>>>>>>>> HeapMonitorThreadOnOffTest.java:77
>>>>>>>>> I would make g_tmp volatile, otherwise the assignment in loop may
>>>>>>>>> theoretical be skipped.
>>>>>>>>>
>>>>>>>>> Also done!
>>>>>>>>
>>>>>>>> Thanks again!
>>>>>>>> Jc
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20180328/74744343/attachment-0001.html>