JDK-8171119: Low-Overhead Heap Profiling

Thu Apr 5 11:54:01 UTC 2018

Hi JC,

   I have just checked on arm32: your patch compiles and runs ok.

   As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does not 
correspond to actual library name: libHeapMonitorTest.c -> 
libHeapMonitorTest.so

Boris

On 04.04.2018 01:54, White, Derek wrote:
> Thanks JC,
> 
> New patch applies cleanly. Compiles and runs (simple test programs) on 
> aarch64.
> 
>   * Derek
> 
> *From:* JC Beyler [mailto:jcbeyler at google.com]
> *Sent:* Monday, April 02, 2018 1:17 PM
> *To:* White, Derek <Derek.White at cavium.com>
> *Cc:* Erik Österlund <erik.osterlund at oracle.com>; 
> serviceability-dev at openjdk.java.net; hotspot-compiler-dev 
> <hotspot-compiler-dev at openjdk.java.net>
> *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
> 
> Hi Derek,
> 
> I know there were a few things that went in that provoked a merge 
> conflict. I worked on it and got it up to date. Sadly my lack of 
> knowledge makes it a full rebase instead of keeping all the history. 
> However, with a newly cloned jdk/hs you should now be able to use:
> 
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
> 
> The change you are referring to was done with the others so perhaps you 
> were unlucky and I forgot it in a webrev and fixed it in another? I 
> don't know but it's been there and I checked, it is here:
> 
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
> 
> I double checked that tlab_end_offset no longer appears in any 
> architecture (as far as I can tell :)).
> 
> Thanks for testing and let me know if you run into any other issues!
> 
> Jc
> 
> On Fri, Mar 30, 2018 at 4:24 PM White, Derek <Derek.White at cavium.com 
> <mailto:Derek.White at cavium.com>> wrote:
> 
>     Hi Jc,
> 
>     I’ve been having trouble getting your patch to apply correctly. I
>     may have based it on the wrong version.
> 
>     In any case, I think there’s a missing update to
>     macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
>     where “JavaThread::tlab_end_offset()” should become
>     “JavaThread::tlab_current_end_offset()”.
> 
>     This should correspond to the other port’s changes in
>     templateTable_<cpu>.cpp files.
> 
>     Thanks!
>     - Derek
> 
>     *From:* hotspot-compiler-dev
>     [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
>     <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On Behalf
>     Of *JC Beyler
>     *Sent:* Wednesday, March 28, 2018 11:43 AM
>     *To:* Erik Österlund <erik.osterlund at oracle.com
>     <mailto:erik.osterlund at oracle.com>>
>     *Cc:* serviceability-dev at openjdk.java.net
>     <mailto:serviceability-dev at openjdk.java.net>; hotspot-compiler-dev
>     <hotspot-compiler-dev at openjdk.java.net
>     <mailto:hotspot-compiler-dev at openjdk.java.net>>
>     *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
> 
>     Hi all,
> 
>     I've been working on deflaking the tests mostly and the wording in
>     the JVMTI spec.
> 
>     Here is the two incremental webrevs:
> 
>     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/
> 
>     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/
> 
>     Here is the total webrev:
> 
>     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/
> 
>     Here are the notes of this change:
> 
>        - Currently the tests pass 100 times in a row, I am working on
>     checking if they pass 1000 times in a row.
> 
>        - The default sampling rate is set to 512k, this is what we use
>     internally and having a default means that to enable the sampling
>     with the default, the user only has to do a enable event/disable
>     event via JVMTI (instead of enable + set sample rate).
> 
>        - I deprecated the code that was handling the fast path tlab
>     refill if it happened since this is now deprecated
> 
>            - Though I saw that Graal is still using it so I have to see
>     what needs to be done there exactly
> 
>     Finally, using the Dacapo benchmark suite, I noted a 1% overhead for
>     when the event system is turned on and the callback to the native
>     agent is just empty. I got a 3% overhead with a 512k sampling rate
>     with the code I put in the native side of my tests.
> 
>     Thanks and comments are appreciated,
> 
>     Jc
> 
>     On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <jcbeyler at google.com
>     <mailto:jcbeyler at google.com>> wrote:
> 
>         Hi all,
> 
>         The incremental webrev update is here:
> 
>         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/
> 
>         The full webrev is here:
> 
>         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/
> 
>         Major change here is:
> 
>            - I've removed the heapMonitoring.cpp code in favor of just
>         having the sampling events as per Serguei's request; I still
>         have to do some overhead measurements but the tests prove the
>         concept can work
> 
>                 - Most of the tlab code is unchanged, the only major
>         part is that now things get sent off to event collectors when
>         used and enabled.
> 
>            - Added the interpreter collectors to handle interpreter
>         execution
> 
>            - Updated the name from SetTlabHeapSampling to
>         SetHeapSampling to be more generic
> 
>            - Added a mutex for the thread sampling so that we can
>         initialize an internal static array safely
> 
>            - Ported the tests from the old system to this new one
> 
>         I've also updated the JEP and CSR to reflect these changes:
> 
>         https://bugs.openjdk.java.net/browse/JDK-8194905
> 
>         https://bugs.openjdk.java.net/browse/JDK-8171119
> 
>         In order to make this have some forward progress, I've removed
>         the heap sampling code entirely and now rely entirely on the
>         event sampling system. The tests reflect this by using a
>         simplified implementation of what an agent could do:
> 
>         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
> 
>         (Search for anything mentioning event_storage).
> 
>         I have not taken the time to port the whole code we had
>         originally in heapMonitoring to this. I hesitate only because
>         that code was in C++, I'd have to port it to C and this is for
>         tests so perhaps what I have now is good enough?
> 
>         As far as testing goes, I've ported all the relevant tests and
>         then added a few:
> 
>             - Turning the system on/off
> 
>             - Testing using various GCs
> 
>             - Testing using the interpreter
> 
>             - Testing the sampling rate
> 
>             - Testing with objects and arrays
> 
>             - Testing with various threads
> 
>         Finally, as overhead goes, I have the numbers of the system off
>         vs a clean build and I have 0% overhead, which is what we'd
>         want. This was using the Dacapo benchmarks. I am now preparing
>         to run a version with the events on using dacapo and will report
>         back here.
> 
>         Any comments are welcome :)
> 
>         Jc
> 
>         On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <jcbeyler at google.com
>         <mailto:jcbeyler at google.com>> wrote:
> 
>             Hi all,
> 
>             I apologize for the delay but I wanted to add an event
>             system and that took a bit longer than expected and I also
>             reworked the code to take into account the deprecation of
>             FastTLABRefill.
> 
>             This update has four parts:
> 
>             A) I moved the implementation from Thread to
>             ThreadHeapSampler inside of Thread. Would you prefer it as a
>             pointer inside of Thread or like this works for you? Second
>             question would be would you rather have an association
>             outside of Thread altogether that tries to remember when
>             threads are live and then we would have something like:
> 
>             ThreadHeapSampler::get_sampling_size(this_thread);
> 
>             I worry about the overhead of this but perhaps it is not too
>             too bad?
> 
>             B) I also have been working on the Allocation event system
>             that sends out a notification at each sampled event. This
>             will be practical when wanting to do something at the
>             allocation point. I'm also looking at if the whole
>             heapMonitoring code could not reside in the agent code and
>             not in the JDK. I'm not convinced but I'm talking to Serguei
>             about it to see/assess :)
> 
>                 - Also added two tests for the new event subsystem
> 
>             C) Removed the slow_path fields inside the TLAB code since
>             now FastTLABRefill is deprecated
> 
>             D) Updated the JVMTI documentation and specification for the
>             methods.
> 
>             So the incremental webrev is here:
> 
>             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/
> 
>             and the full webrev is here:
> 
>             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10
> 
>             I believe I have updated the various JIRA issues that track
>             this :)
> 
>             Thanks for your input,
> 
>             Jc
> 
>             On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler
>             <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> 
>                 Hi Erik,
> 
>                 I inlined my answers, which the last one seems to answer
>                 Robbin's concerns about the same thing (adding things to
>                 Thread).
> 
>                 On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund
>                 <erik.osterlund at oracle.com
>                 <mailto:erik.osterlund at oracle.com>> wrote:
> 
>                     Hi JC,
> 
>                     Comments are inlined below.
> 
>                     On 2018-02-13 06:18, JC Beyler wrote:
> 
>                         Hi Erik,
> 
>                         Thanks for your answers, I've now inlined my own
>                         answers/comments.
> 
>                         I've done a new webrev here:
> 
>                         http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
>                         <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>
> 
>                         The incremental is here:
> 
>                         http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
>                         <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>
> 
>                         Note to all:
> 
>                            - I've been integrating changes from
>                         Erin/Serguei/David comments so this webrev
>                         incremental is a bit an answer to all comments
>                         in one. I apologize for that :)
> 
>                         On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund
>                         <erik.osterlund at oracle.com
>                         <mailto:erik.osterlund at oracle.com>> wrote:
> 
>                             Hi JC,
> 
>                             Sorry for the delayed reply.
> 
>                             Inlined answers:
> 
> 
> 
>                             On 2018-02-06 00:04, JC Beyler wrote:
> 
>                                 Hi Erik,
> 
>                                 (Renaming this to be folded into the
>                                 newly renamed thread :))
> 
>                                 First off, thanks a lot for reviewing
>                                 the webrev! I appreciate it!
> 
>                                 I updated the webrev to:
>                                 http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
>                                 <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>
> 
>                                 And the incremental one is here:
>                                 http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
>                                 <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>
> 
>                                 It contains:
>                                 - The change for since from 9 to 11 for
>                                 the jvmti.xml
>                                 - The use of the OrderAccess for initialized
>                                 - Clearing the oop
> 
>                                 I also have inlined my answers to your
>                                 comments. The biggest question
>                                 will come from the multiple *_end
>                                 variables. A bit of the logic there
>                                 is due to handling the slow path refill
>                                 vs fast path refill and
>                                 checking that the rug was not pulled
>                                 underneath the slowpath. I
>                                 believe that a previous comment was that
>                                 TlabFastRefill was going to
>                                 be deprecated.
> 
>                                 If this is true, we could revert this
>                                 code a bit and just do a : if
>                                 TlabFastRefill is enabled, disable this.
>                                 And then deprecate that when
>                                 TlabFastRefill is deprecated.
> 
>                                 This might simplify this webrev and I
>                                 can work on a follow-up that
>                                 either: removes TlabFastRefill if Robbin
>                                 does not have the time to do
>                                 it or add the support to the assembly
>                                 side to handle this correctly.
>                                 What do you think?
> 
>                             I support removing TlabFastRefill, but I
>                             think it is good to not depend on that
>                             happening first.
> 
> 
>                         I'm slowly pushing on the FastTLABRefill
>                         (https://bugs.openjdk.java.net/browse/JDK-8194084),
>                         I agree on keeping both separate for now though
>                         so that we can think of both differently
> 
>                                 Now, below, inlined are my answers:
> 
>                                 On Fri, Feb 2, 2018 at 8:44 AM, Erik
>                                 Österlund
>                                 <erik.osterlund at oracle.com
>                                 <mailto:erik.osterlund at oracle.com>> wrote:
> 
>                                     Hi JC,
> 
>                                     Hope I am reviewing the right
>                                     version of your work. Here goes...
> 
>                                     src/hotspot/share/gc/shared/collectedHeap.inline.hpp:
> 
>                                        159   
>                                       AllocTracer::send_allocation_outside_tlab(klass, result, size *
>                                     HeapWordSize, THREAD);
>                                        160
>                                        161   
>                                       THREAD->tlab().handle_sample(THREAD, result, size);
>                                        162     return result;
>                                        163   }
> 
>                                     Should not call tlab()->X without
>                                     checking if (UseTLAB) IMO.
> 
>                                 Done!
> 
> 
>                             More about this later.
> 
>                                     src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:
> 
>                                     So first of all, there seems to
>                                     quite a few ends. There is an "end",
>                                     a "hard
>                                     end", a "slow path end", and an
>                                     "actual end". Moreover, it seems
>                                     like the
>                                     "hard end" is actually further away
>                                     than the "actual end". So the "hard end"
>                                     seems like more of a "really
>                                     definitely actual end" or something.
>                                     I don't
>                                     know about you, but I think it looks
>                                     kind of messy. In particular, I don't
>                                     feel like the name "actual end"
>                                     reflects what it represents,
>                                     especially when
>                                     there is another end that is behind
>                                     the "actual end".
> 
>                                        413 HeapWord*
>                                     ThreadLocalAllocBuffer::hard_end() {
>                                        414   // Did a fast TLAB refill
>                                     occur?
>                                        415   if (_slow_path_end != _end) {
>                                        416     // Fix up the actual end
>                                     to be now the end of this TLAB.
>                                        417     _slow_path_end = _end;
>                                        418     _actual_end = _end;
>                                        419   }
>                                        420
>                                        421   return _actual_end +
>                                     alignment_reserve();
>                                        422 }
> 
>                                     I really do not like making getters
>                                     unexpectedly have these kind of side
>                                     effects. It is not expected that
>                                     when you ask for the "hard end", you
>                                     implicitly update the "slow path
>                                     end" and "actual end" to new values.
> 
>                                 As I said, a lot of this is due to the
>                                 FastTlabRefill. If I make this
>                                 not supporting FastTlabRefill, this goes
>                                 away. The reason the system
>                                 needs to update itself at the get is
>                                 that you only know at that get if
>                                 things have shifted underneath the tlab
>                                 slow path. I am not sure of
>                                 really better names (naming is hard!),
>                                 perhaps we could do these
>                                 names:
> 
>                                 - current_tlab_end       // Either the
>                                 allocated tlab end or a sampling point
>                                 - last_allocation_address  // The end of
>                                 the tlab allocation
>                                 - last_slowpath_allocated_end  // In
>                                 case a fast refill occurred the
>                                 end might have changed, this is to
>                                 remember slow vs fast past refills
> 
>                                 the hard_end method can be renamed to
>                                 something like:
>                                 tlab_end_pointer()        // The end of
>                                 the lab including a bit of
>                                 alignment reserved bytes
> 
>                             Those names sound better to me. Could you
>                             please provide a mapping from the old names
>                             to the new names so I understand which one
>                             is which please?
> 
>                             This is my current guess of what you are
>                             proposing:
> 
>                             end -> current_tlab_end
>                             actual_end -> last_allocation_address
>                             slow_path_end -> last_slowpath_allocated_end
>                             hard_end -> tlab_end_pointer
> 
>                         Yes that is correct, that was what I was proposing.
> 
>                             I would prefer this naming:
> 
>                             end -> slow_path_end // the end for taking a
>                             slow path; either due to sampling or refilling
>                             actual_end -> allocation_end // the end for
>                             allocations
>                             slow_path_end -> last_slow_path_end // last
>                             address for slow_path_end (as opposed to
>                             allocation_end)
>                             hard_end -> reserved_end // the end of the
>                             reserved space of the TLAB
> 
>                             About setting things in the getter... that
>                             still seems like a very unpleasant thing to
>                             me. It would be better to inspect the call
>                             hierarchy and explicitly update the ends
>                             where they need updating, and assert in the
>                             getter that they are in sync, rather than
>                             implicitly setting various ends as a
>                             surprising side effect in a getter. It looks
>                             like the call hierarchy is very small. With
>                             my new naming convention, reserved_end()
>                             would presumably return _allocation_end +
>                             alignment_reserve(), and have an assert
>                             checking that _allocation_end ==
>                             _last_slow_path_allocation_end, complaining
>                             that this invariant must hold, and that a
>                             caller to this function, such as
>                             make_parsable(), must first explicitly
>                             synchronize the ends as required, to honor
>                             that invariant.
> 
> 
>                         I've renamed the variables to how you preferred
>                         it except for the _end one. I did:
> 
>                         current_end
> 
>                         last_allocation_address
> 
>                         tlab_end_ptr
> 
>                         The reason is that the architecture dependent
>                         code use the thread.hpp API and it already has
>                         tlab included into the name so it becomes
>                         tlab_current_end (which is better that
>                         tlab_current_tlab_end in my opinion).
> 
>                         I also moved the update into a separate method
>                         with a TODO that says to remove it when
>                         FastTLABRefill is deprecated
> 
>                     This looks a lot better now. Thanks.
> 
>                     Note that the following comment now needs updating
>                     accordingly in threadLocalAllocBuffer.hpp:
> 
>                        41 //            Heap sampling is performed via
>                     the end/actual_end fields.
> 
>                        42 //            actual_end contains the real end
>                     of the tlab allocation,
> 
>                        43 //            whereas end can be set to an
>                     arbitrary spot in the tlab to
> 
>                        44 //            trip the return and sample the
>                     allocation.
> 
>                        45 //            slow_path_end is used to track
>                     if a fast tlab refill occured
> 
>                        46 //            between slowpath calls.
> 
>                     There might be other comments too, I have not looked
>                     in detail.
> 
>                 This was the only spot that still had an actual_end, I
>                 fixed it now. I'll do a sweep to double check other
>                 comments.
> 
> 
> 
>                                 Not sure it's better but before updating
>                                 the webrev, I wanted to try
>                                 to get input/consensus :)
> 
>                                 (Note hard_end was always further off
>                                 than end).
> 
>                                     src/hotspot/share/prims/jvmti.xml:
> 
>                                     10357       <capabilityfield
>                                     id="can_sample_heap" since="9">
>                                     10358         <description>
>                                     10359           Can sample the heap.
>                                     10360           If this capability
>                                     is enabled then the heap sampling
>                                     methods
>                                     can be called.
>                                     10361         </description>
>                                     10362       </capabilityfield>
> 
>                                     Looks like this capability should
>                                     not be "since 9" if it gets integrated
>                                     now.
> 
>                                 Updated now to 11, crossing my fingers :)
> 
>                                     src/hotspot/share/runtime/heapMonitoring.cpp:
> 
>                                        448       if
>                                     (is_alive->do_object_b(value)) {
>                                        449         // Update the oop to
>                                     point to the new object if it is still
>                                     alive.
>                                        450         f->do_oop(&(trace.obj));
>                                        451
>                                        452         // Copy the old
>                                     trace, if it is still live.
>                                        453       
>                                       _allocated_traces->at_put(curr_pos++, trace);
>                                        454
>                                        455         // Store the live
>                                     trace in a cache, to be served up on
>                                     /heapz.
>                                        456       
>                                       _traces_on_last_full_gc->append(trace);
>                                        457
>                                        458         count++;
>                                        459       } else {
>                                        460         // If the old trace
>                                     is no longer live, add it to the list of
>                                        461         // recently collected
>                                     garbage.
>                                        462       
>                                       store_garbage_trace(trace);
>                                        463       }
> 
>                                     In the case where the oop was not
>                                     live, I would like it to be explicitly
>                                     cleared.
> 
>                                 Done I think how you wanted it. Let me
>                                 know because I'm not familiar
>                                 with the RootAccess API. I'm unclear if
>                                 I'm doing this right or not so
>                                 reviews of these parts are highly
>                                 appreciated. Robbin had talked of
>                                 perhaps later pushing this all into a
>                                 OopStorage, should I do this now
>                                 do you think? Or can that wait a second
>                                 webrev later down the road?
> 
>                             I think using handles can and should be done
>                             later. You can use the Access API now.
>                             I noticed that you are missing an #include
>                             "oops/access.inline.hpp" in your
>                             heapMonitoring.cpp file.
> 
>                         The missing header is there for me so I don't
>                         know, I made sure it is present in the latest
>                         webrev. Sorry about that.
> 
>                                 + Did I clear it the way you wanted me
>                                 to or were you thinking of
>                                 something else?
> 
> 
>                             That is precisely how I wanted it to be
>                             cleared. Thanks.
> 
>                                 + Final question here, seems like if I
>                                 were to want to not do the
>                                 f->do_oop directly on the trace.obj, I'd
>                                 need to do something like:
> 
>                                      f->do_oop(&value);
>                                      ...
>                                      trace->store_oop(value);
> 
>                                 to update the oop internally. Is that
>                                 right/is that one of the
>                                 advantages of going to the Oopstorage
>                                 sooner than later?
> 
> 
>                             I think you really want to do the do_oop on
>                             the root directly. Is there a particular
>                             reason why you would not want to do that?
>                             Otherwise, yes - the benefit with using the
>                             handle approach is that you do not need to
>                             call do_oop explicitly in your code.
> 
>                         There is no reason except that now we have a
>                         load_oop and a get_oop_addr, I was not sure what
>                         you would think of that.
> 
>                     That's fine.
> 
>                                     Also I see a lot of
>                                     concurrent-looking use of the
>                                     following field:
>                                        267   volatile bool _initialized;
> 
>                                     Please note that the "volatile"
>                                     qualifier does not help with reordering
>                                     here. Reordering between volatile
>                                     and non-volatile fields is
>                                     completely free
>                                     for both compiler and hardware,
>                                     except for windows with MSVC, where
>                                     volatile
>                                     semantics is defined to use
>                                     acquire/release semantics, and the
>                                     hardware is
>                                     TSO. But for the general case, I
>                                     would expect this field to be stored
>                                     with
>                                     OrderAccess::release_store and
>                                     loaded with OrderAccess::load_acquire.
>                                     Otherwise it is not thread safe.
> 
>                                 Because everything is behind a mutex, I
>                                 wasn't really worried about
>                                 this. I have a test that has multiple
>                                 threads trying to hit this
>                                 corner case and it passes.
> 
>                                 However, to be paranoid, I updated it to
>                                 using the OrderAccess API
>                                 now, thanks! Let me know what you think
>                                 there too!
> 
> 
>                             If it is indeed always supposed to be read
>                             and written under a mutex, then I would
>                             strongly prefer to have it accessed as a
>                             normal non-volatile member, and have an
>                             assertion that given lock is held or we are
>                             in a safepoint, as we do in many other
>                             places. Something like this:
> 
>                             assert(HeapMonitorStorage_lock->owned_by_self()
>                             || (SafepointSynchronize::is_at_safepoint()
>                             && Thread::current()->is_VM_thread()), "this
>                             should not be accessed concurrently");
> 
>                             It would be confusing to people reading the
>                             code if there are uses of OrderAccess that
>                             are actually always protected under a mutex.
> 
>                         Thank you for the exact example to be put in the
>                         code! I put it around each access/assignment of
>                         the _initialized method and found one case where
>                         yes you can touch it and not have the lock. It
>                         actually is "ok" because you don't act on the
>                         storage until later and only when you really
>                         want to modify the storage (see the
>                         object_alloc_do_sample method which calls the
>                         add_trace method).
> 
>                         But, because of this, I'm going to put the
>                         OrderAccess here, I'll do some performance
>                         numbers later and if there are issues, I might
>                         add a "unsafe" read and a "safe" one to make it
>                         explicit to the reader. But I don't think it
>                         will come to that.
> 
> 
>                     Okay. This double return in heapMonitoring.cpp looks
>                     wrong:
> 
>                       283   bool initialized() {
>                       284     return
>                     OrderAccess::load_acquire(&_initialized) != 0;
>                       285     return _initialized;
>                       286   }
> 
>                     Since you said object_alloc_do_sample() is the only
>                     place where you do not hold the mutex while reading
>                     initialized(), I had a closer look at that. It looks
>                     like in its current shape, the lack of a mutex may
>                     lead to a memory leak. In particular, it first
>                     checks if (initialized()). Let's assume this is now
>                     true. It then allocates a bunch of stuff, and checks
>                     if the number of frames were over 0. If they were,
>                     it calls StackTraceStorage::storage()->add_trace()
>                     seemingly hoping that after grabbing the lock in
>                     there, initialized() will still return true. But it
>                     could now return false and skip doing anything, in
>                     which case the allocated stuff will never be freed.
> 
>                 I fixed this now by making add_trace return a boolean
>                 and checking for that. It will be in the next webrev.
>                 Thanks, the truth is that in our implementation the
>                 system is always on or off, so this never really occurs
>                 :). In this version though, that is not true and it's
>                 important to handle so thanks again!
> 
> 
>                     So the analysis seems to be that _initialized is
>                     only used outside of the mutex in once instance,
>                     where it is used to perform double-checked locking,
>                     that actually causes a memory leak.
> 
>                     I am not proposing how to fix that, just raising the
>                     issue. If you still want to perform this
>                     double-checked locking somehow, then the use of
>                     acquire/release still seems odd. Because the memory
>                     ordering restrictions of it never comes into play in
>                     this particular case. If it ever did, then the use
>                     of destroy_stuff(); release_store(_initialized, 0)
>                     would be broken anyway as that would imply that
>                     whatever concurrent reader there ever was would
>                     after reading _initialized with load_acquire() could
>                     *never* read the data that is concurrently destroyed
>                     anyway. I would be biased to think that
>                     RawAccess<MO_RELAXED>::load/store looks like a more
>                     appropriate solution, given that the memory leak
>                     issue is resolved. I do not know how painful it
>                     would be to not perform this double-checked locking.
> 
>                 So I agree with this entirely. I looked also a bit more
>                 and the difference and code really stems from our
>                 internal version. In this version however, there are
>                 actually a lot of things going on that I did not go
>                 entirely through in my head but this comment made me
>                 ponder a bit more on it.
> 
>                 Since every object_alloc_do_sample is protected by a
>                 check to HeapMonitoring::enabled(), there is only a
>                 small chance that the call is happening when things have
>                 been disabled. So there is no real need to do a first
>                 check on the initialized, it is a rare occurence that a
>                 call happens to object_alloc_do_sample and the
>                 initialized of the storage returns false.
> 
>                 (By the way, even if you did call object_alloc_do_sample
>                 without looking at HeapMonitoring::enabled(), that would
>                 be ok too. You would gather the stacktrace and get
>                 nowhere at the add_trace call, which would return false;
>                 so though not optimal performance wise, nothing would
>                 break).
> 
>                 Furthermore, the add_trace is really the moment of no
>                 return and we have the mutex lock and then the
>                 initialized check. So, in the end, I did two things: I
>                 removed that first check and then I removed the
>                 OrderAccess for the storage initialized. I think now I
>                 have a better grasp and understanding why it was done in
>                 our code and why it is not needed here. Thanks for
>                 pointing it out :). This now still passes my JTREG
>                 tests, especially the threaded one.
> 
> 
> 
>                                     As a kind of meta comment, I wonder
>                                     if it would make sense to add sampling
>                                     for non-TLAB allocations. Seems like
>                                     if someone is rapidly allocating a
>                                     whole bunch of 1 MB objects that
>                                     never fit in a TLAB, I might still be
>                                     interested in seeing that in my
>                                     traces, and not get surprised that the
>                                     allocation rate is very high yet not
>                                     showing up in any profiles.
> 
>                                 That is handled by the handle_sample
>                                 where you wanted me to put a
>                                 UseTlab because you hit that case if the
>                                 allocation is too big.
> 
> 
>                             I see. It was not obvious to me that
>                             non-TLAB sampling is done in the TLAB class.
>                             That seems like an abstraction crime.
>                             What I wanted in my previous comment was
>                             that we do not call into the TLAB when we
>                             are not using TLABs. If there is sampling
>                             logic in the TLAB that is used for something
>                             else than TLABs, then it seems like that
>                             logic simply does not belong inside of the
>                             TLAB. It should be moved out of the TLAB,
>                             and instead have the TLAB call this common
>                             abstraction that makes sense.
> 
>                         So in the incremental version:
> 
>                         http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
>                         <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>,
>                         this is still a "crime". The reason is that the
>                         system has to have the bytes_until_sample on a
>                         per-thread level and it made "sense" to have it
>                         with the TLAB implementation. Also, I was not
>                         sure how people felt about adding something to
>                         the thread instance instead.
> 
>                         Do you think it fits better at the Thread level?
>                         I can see how difficult it is to make it happen
>                         there and add some logic there. Let me know what
>                         you think.
> 
> 
>                     We have an unfortunate situation where everyone that
>                     has some fields that are thread local tend to dump
>                     them right into Thread, making the size and
>                     complexity of Thread grow as it becomes tightly
>                     coupled with various unrelated subsystems. It would
>                     be desirable to have a separate class for this
>                     instead that encapsulates the sampling logic. That
>                     class could possibly reside in Thread though as a
>                     value object of Thread.
> 
>                 I imagined that would be the case but was not sure. I
>                 will look at the example that Robbin is talking about
>                 (ThreadSMR) and will see how to refactor my code to use
>                 that.
> 
>                 Thanks again for your help,
> 
>                 Jc
> 
> 
> 
>                             Hope I have answered your questions and that
>                             my feedback makes sense to you.
> 
>                         You have and thank you for them, I think we are
>                         getting to a cleaner implementation and things
>                         are getting better and more readable :)
> 
> 
>                     Yes it is getting better.
> 
>                     Thanks,
>                     /Erik
> 
> 
> 
>                         Thanks for your help!
> 
>                         Jc
> 
>                             Thanks,
>                             /Erik
> 
>                                 I double checked by changing the test
>                                 http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java
>                                 <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java>
> 
>                                 to use a smaller Tlab (2048) and made
>                                 the object bigger and it goes
>                                 through that and passes.
> 
>                                 Thanks again for your review and I look
>                                 forward to your pointers for
>                                 the questions I now have raised!
>                                 Jc
> 
> 
> 
> 
> 
> 
> 
>                                     Thanks,
>                                     /Erik
> 
> 
>                                     On 2018-01-26 06:45, JC Beyler wrote:
> 
>                                         Thanks Robbin for the reviews :)
> 
>                                         The new full webrev is here:
>                                         http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.03/
>                                         <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.03/>
>                                         The incremental webrev is here:
>                                         http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.02_03/
>                                         <http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.02_03/>
> 
>                                         I inlined my answers:
> 
>                                         On Thu, Jan 25, 2018 at 1:15 AM,
>                                         Robbin Ehn
>                                         <robbin.ehn at oracle.com
>                                         <mailto:robbin.ehn at oracle.com>>
>                                         wrote:
> 
>                                             Hi JC, great to see another
>                                             revision!
> 
>                                             ####
>                                             heapMonitoring.cpp
> 
>                                             StackTraceData should not
>                                             contain the oop for 'safety'
>                                             reasons.
>                                             When StackTraceData is moved
>                                             from _allocated_traces:
>                                             L452 store_garbage_trace(trace);
>                                             it contains a dead oop.
>                                             _allocated_traces could
>                                             instead be a tupel of oop
>                                             and StackTraceData thus
>                                             dead oops are not kept.
> 
>                                         Done I used inheritance to make
>                                         the copier work regardless but the
>                                         idea is the same.
> 
>                                             You should use the new
>                                             Access API for loading the
>                                             oop, something like
>                                             this:
>                                             RootAccess<ON_PHANTOM_OOP_REF |
>                                             AS_NO_KEEPALIVE>::load(...)
>                                             I don't think you need to
>                                             use Access API for clearing
>                                             the oop, but it
>                                             would
>                                             look nicer. And you
>                                             shouldn't probably be using:
>                                             Universe::heap()->is_in_reserved(value)
> 
>                                         I am unfamiliar with this but I
>                                         think I did do it like you wanted me
>                                         to (all tests pass so that's a
>                                         start). I'm not sure how to
>                                         clear the
>                                         oop exactly, is there somewhere
>                                         that does that, which I can use
>                                         to do
>                                         the same?
> 
>                                         I removed the is_in_reserved,
>                                         this came from our internal
>                                         version, I
>                                         don't know why it was there but
>                                         my tests work without so I
>                                         removed it
>                                         :)
> 
>                                             The lock:
>                                             L424   MutexLocker
>                                             mu(HeapMonitorStorage_lock);
>                                             Is not needed as far as I
>                                             can see.
>                                             weak_oops_do is called in a
>                                             safepoint, no TLAB
>                                             allocation can happen and
>                                             JVMTI thread can't access
>                                             these data-structures. Is
>                                             there something more
>                                             to
>                                             this lock that I'm missing?
> 
>                                         Since a thread can call the
>                                         JVMTI getLiveTraces (or any of
>                                         the other
>                                         ones), it can get to the point
>                                         of trying to copying the
>                                         _allocated_traces. I imagine it
>                                         is possible that this is happening
>                                         during a GC or that it can be
>                                         started and a GC happens afterwards.
>                                         Therefore, it seems to me that
>                                         you want this protected, no?
> 
>                                             ####
>                                             You have 6 files without any
>                                             changes in them (any more):
>                                             g1CollectedHeap.cpp
>                                             psMarkSweep.cpp
>                                             psParallelCompact.cpp
>                                             genCollectedHeap.cpp
>                                             referenceProcessor.cpp
>                                             thread.hpp
> 
>                                         Done.
> 
>                                             ####
>                                             I have not looked closely,
>                                             but is it possible to hide
>                                             heap sampling in
>                                             AllocTracer ? (with some
>                                             minor changes to the
>                                             AllocTracer API)
> 
>                                         I am imagining that you are
>                                         saying to move the code that
>                                         does the
>                                         sampling code (change the tlab
>                                         end, do the call to HeapMonitoring,
>                                         etc.) into the AllocTracer code
>                                         itself? I think that is right
>                                         and I'll
>                                         look if that is possible and
>                                         prepare a webrev to show what
>                                         would be
>                                         needed to make that happen.
> 
>                                             ####
>                                             Minor nit, when declaring
>                                             pointer there is a little
>                                             mix of having the
>                                             pointer adjacent by type
>                                             name and data name. (Most
>                                             hotspot code is by
>                                             type
>                                             name)
>                                             E.g.
>                                             heapMonitoring.cpp:711   
>                                               jvmtiStackTrace *trace = ....
>                                             heapMonitoring.cpp:733     
>                                                 Method* m = vfst.method();
>                                             (not just this file)
> 
>                                         Done!
> 
>                                             ####
>                                             HeapMonitorThreadOnOffTest.java:77
>                                             I would make g_tmp volatile,
>                                             otherwise the assignment in
>                                             loop may
>                                             theoretical be skipped.
> 
>                                         Also done!
> 
>                                         Thanks again!
>                                         Jc
>