JDK-8171119: Low-Overhead Heap Profiling

Thu Apr 5 18:15:56 UTC 2018

Thanks Boris and Derek for testing it.

Yes I was trying to get a new version out that had the tests ported as well
but got sidetracked while trying to add tests and two new features.

Here is the incremental webrev:

Here is the full webrev:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11/

Basically, the new tests assert this:
  - Only one agent can currently ask for the sampling, I'm currently seeing
if I can push to a next webrev the multi-agent support to start doing a
code freeze on this one
  - The event is not thread-enabled, meaning like the
VMObjectAllocationEvent, it's an all or nothing event; same as the
multi-agent, I'm going to see if a future webrev to add the support is a
better idea to freeze this webrev a bit

There was another item that I added here and I'm unsure this webrev is
stable in debug mode: I added an assertion system to ascertain that all
paths leading to a TLAB slow path (and hence a sampling point) have a
sampling collector ready to post the event if a user wants it. This might
break a few thing in debug mode as I'm working through the kinks of that as
well. However, in release mode, this new webrev passes all the tests in
hotspot/jtreg/serviceability/jvmti/HeapMonitor.

Let me know what you think,
Jc

On Thu, Apr 5, 2018 at 4:56 AM Boris Ulasevich <boris.ulasevich at bell-sw.com>
wrote:

> Hi JC,
>
>    I have just checked on arm32: your patch compiles and runs ok.
>
>    As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does not
> correspond to actual library name: libHeapMonitorTest.c ->
> libHeapMonitorTest.so
>
> Boris
>
> On 04.04.2018 01:54, White, Derek wrote:
> > Thanks JC,
> >
> > New patch applies cleanly. Compiles and runs (simple test programs) on
> > aarch64.
> >
> >   * Derek
> >
> > *From:* JC Beyler [mailto:jcbeyler at google.com]
> > *Sent:* Monday, April 02, 2018 1:17 PM
> > *To:* White, Derek <Derek.White at cavium.com>
> > *Cc:* Erik Österlund <erik.osterlund at oracle.com>;
> > serviceability-dev at openjdk.java.net; hotspot-compiler-dev
> > <hotspot-compiler-dev at openjdk.java.net>
> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
> >
> > Hi Derek,
> >
> > I know there were a few things that went in that provoked a merge
> > conflict. I worked on it and got it up to date. Sadly my lack of
> > knowledge makes it a full rebase instead of keeping all the history.
> > However, with a newly cloned jdk/hs you should now be able to use:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
> >
> > The change you are referring to was done with the others so perhaps you
> > were unlucky and I forgot it in a webrev and fixed it in another? I
> > don't know but it's been there and I checked, it is here:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
> >
> > I double checked that tlab_end_offset no longer appears in any
> > architecture (as far as I can tell :)).
> >
> > Thanks for testing and let me know if you run into any other issues!
> >
> > Jc
> >
> > On Fri, Mar 30, 2018 at 4:24 PM White, Derek <Derek.White at cavium.com
> > <mailto:Derek.White at cavium.com>> wrote:
> >
> >     Hi Jc,
> >
> >     I’ve been having trouble getting your patch to apply correctly. I
> >     may have based it on the wrong version.
> >
> >     In any case, I think there’s a missing update to
> >     macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
> >     where “JavaThread::tlab_end_offset()” should become
> >     “JavaThread::tlab_current_end_offset()”.
> >
> >     This should correspond to the other port’s changes in
> >     templateTable_<cpu>.cpp files.
> >
> >     Thanks!
> >     - Derek
> >
> >     *From:* hotspot-compiler-dev
> >     [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
> >     <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On Behalf
> >     Of *JC Beyler
> >     *Sent:* Wednesday, March 28, 2018 11:43 AM
> >     *To:* Erik Österlund <erik.osterlund at oracle.com
> >     <mailto:erik.osterlund at oracle.com>>
> >     *Cc:* serviceability-dev at openjdk.java.net
> >     <mailto:serviceability-dev at openjdk.java.net>; hotspot-compiler-dev
> >     <hotspot-compiler-dev at openjdk.java.net
> >     <mailto:hotspot-compiler-dev at openjdk.java.net>>
> >     *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
> >
> >     Hi all,
> >
> >     I've been working on deflaking the tests mostly and the wording in
> >     the JVMTI spec.
> >
> >     Here is the two incremental webrevs:
> >
> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/
> >
> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/
> >
> >     Here is the total webrev:
> >
> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/
> >
> >     Here are the notes of this change:
> >
> >        - Currently the tests pass 100 times in a row, I am working on
> >     checking if they pass 1000 times in a row.
> >
> >        - The default sampling rate is set to 512k, this is what we use
> >     internally and having a default means that to enable the sampling
> >     with the default, the user only has to do a enable event/disable
> >     event via JVMTI (instead of enable + set sample rate).
> >
> >        - I deprecated the code that was handling the fast path tlab
> >     refill if it happened since this is now deprecated
> >
> >            - Though I saw that Graal is still using it so I have to see
> >     what needs to be done there exactly
> >
> >     Finally, using the Dacapo benchmark suite, I noted a 1% overhead for
> >     when the event system is turned on and the callback to the native
> >     agent is just empty. I got a 3% overhead with a 512k sampling rate
> >     with the code I put in the native side of my tests.
> >
> >     Thanks and comments are appreciated,
> >
> >     Jc
> >
> >     On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <jcbeyler at google.com
> >     <mailto:jcbeyler at google.com>> wrote:
> >
> >         Hi all,
> >
> >         The incremental webrev update is here:
> >
> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/
> >
> >         The full webrev is here:
> >
> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/
> >
> >         Major change here is:
> >
> >            - I've removed the heapMonitoring.cpp code in favor of just
> >         having the sampling events as per Serguei's request; I still
> >         have to do some overhead measurements but the tests prove the
> >         concept can work
> >
> >                 - Most of the tlab code is unchanged, the only major
> >         part is that now things get sent off to event collectors when
> >         used and enabled.
> >
> >            - Added the interpreter collectors to handle interpreter
> >         execution
> >
> >            - Updated the name from SetTlabHeapSampling to
> >         SetHeapSampling to be more generic
> >
> >            - Added a mutex for the thread sampling so that we can
> >         initialize an internal static array safely
> >
> >            - Ported the tests from the old system to this new one
> >
> >         I've also updated the JEP and CSR to reflect these changes:
> >
> >         https://bugs.openjdk.java.net/browse/JDK-8194905
> >
> >         https://bugs.openjdk.java.net/browse/JDK-8171119
> >
> >         In order to make this have some forward progress, I've removed
> >         the heap sampling code entirely and now rely entirely on the
> >         event sampling system. The tests reflect this by using a
> >         simplified implementation of what an agent could do:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
> >
> >         (Search for anything mentioning event_storage).
> >
> >         I have not taken the time to port the whole code we had
> >         originally in heapMonitoring to this. I hesitate only because
> >         that code was in C++, I'd have to port it to C and this is for
> >         tests so perhaps what I have now is good enough?
> >
> >         As far as testing goes, I've ported all the relevant tests and
> >         then added a few:
> >
> >             - Turning the system on/off
> >
> >             - Testing using various GCs
> >
> >             - Testing using the interpreter
> >
> >             - Testing the sampling rate
> >
> >             - Testing with objects and arrays
> >
> >             - Testing with various threads
> >
> >         Finally, as overhead goes, I have the numbers of the system off
> >         vs a clean build and I have 0% overhead, which is what we'd
> >         want. This was using the Dacapo benchmarks. I am now preparing
> >         to run a version with the events on using dacapo and will report
> >         back here.
> >
> >         Any comments are welcome :)
> >
> >         Jc
> >
> >         On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <jcbeyler at google.com
> >         <mailto:jcbeyler at google.com>> wrote:
> >
> >             Hi all,
> >
> >             I apologize for the delay but I wanted to add an event
> >             system and that took a bit longer than expected and I also
> >             reworked the code to take into account the deprecation of
> >             FastTLABRefill.
> >
> >             This update has four parts:
> >
> >             A) I moved the implementation from Thread to
> >             ThreadHeapSampler inside of Thread. Would you prefer it as a
> >             pointer inside of Thread or like this works for you? Second
> >             question would be would you rather have an association
> >             outside of Thread altogether that tries to remember when
> >             threads are live and then we would have something like:
> >
> >             ThreadHeapSampler::get_sampling_size(this_thread);
> >
> >             I worry about the overhead of this but perhaps it is not too
> >             too bad?
> >
> >             B) I also have been working on the Allocation event system
> >             that sends out a notification at each sampled event. This
> >             will be practical when wanting to do something at the
> >             allocation point. I'm also looking at if the whole
> >             heapMonitoring code could not reside in the agent code and
> >             not in the JDK. I'm not convinced but I'm talking to Serguei
> >             about it to see/assess :)
> >
> >                 - Also added two tests for the new event subsystem
> >
> >             C) Removed the slow_path fields inside the TLAB code since
> >             now FastTLABRefill is deprecated
> >
> >             D) Updated the JVMTI documentation and specification for the
> >             methods.
> >
> >             So the incremental webrev is here:
> >
> >             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/
> >
> >             and the full webrev is here:
> >
> >             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10
> >
> >             I believe I have updated the various JIRA issues that track
> >             this :)
> >
> >             Thanks for your input,
> >
> >             Jc
> >
> >             On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler
> >             <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> >
> >                 Hi Erik,
> >
> >                 I inlined my answers, which the last one seems to answer
> >                 Robbin's concerns about the same thing (adding things to
> >                 Thread).
> >
> >                 On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund
> >                 <erik.osterlund at oracle.com
> >                 <mailto:erik.osterlund at oracle.com>> wrote:
> >
> >                     Hi JC,
> >
> >                     Comments are inlined below.
> >
> >                     On 2018-02-13 06:18, JC Beyler wrote:
> >
> >                         Hi Erik,
> >
> >                         Thanks for your answers, I've now inlined my own
> >                         answers/comments.
> >
> >                         I've done a new webrev here:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
> >                         <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>
> >
> >                         The incremental is here:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
> >                         <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>
> >
> >                         Note to all:
> >
> >                            - I've been integrating changes from
> >                         Erin/Serguei/David comments so this webrev
> >                         incremental is a bit an answer to all comments
> >                         in one. I apologize for that :)
> >
> >                         On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund
> >                         <erik.osterlund at oracle.com
> >                         <mailto:erik.osterlund at oracle.com>> wrote:
> >
> >                             Hi JC,
> >
> >                             Sorry for the delayed reply.
> >
> >                             Inlined answers:
> >
> >
> >
> >                             On 2018-02-06 00:04, JC Beyler wrote:
> >
> >                                 Hi Erik,
> >
> >                                 (Renaming this to be folded into the
> >                                 newly renamed thread :))
> >
> >                                 First off, thanks a lot for reviewing
> >                                 the webrev! I appreciate it!
> >
> >                                 I updated the webrev to:
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
> >                                 <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>
> >
> >                                 And the incremental one is here:
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
> >                                 <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>
> >
> >                                 It contains:
> >                                 - The change for since from 9 to 11 for
> >                                 the jvmti.xml
> >                                 - The use of the OrderAccess for
> initialized
> >                                 - Clearing the oop
> >
> >                                 I also have inlined my answers to your
> >                                 comments. The biggest question
> >                                 will come from the multiple *_end
> >                                 variables. A bit of the logic there
> >                                 is due to handling the slow path refill
> >                                 vs fast path refill and
> >                                 checking that the rug was not pulled
> >                                 underneath the slowpath. I
> >                                 believe that a previous comment was that
> >                                 TlabFastRefill was going to
> >                                 be deprecated.
> >
> >                                 If this is true, we could revert this
> >                                 code a bit and just do a : if
> >                                 TlabFastRefill is enabled, disable this.
> >                                 And then deprecate that when
> >                                 TlabFastRefill is deprecated.
> >
> >                                 This might simplify this webrev and I
> >                                 can work on a follow-up that
> >                                 either: removes TlabFastRefill if Robbin
> >                                 does not have the time to do
> >                                 it or add the support to the assembly
> >                                 side to handle this correctly.
> >                                 What do you think?
> >
> >                             I support removing TlabFastRefill, but I
> >                             think it is good to not depend on that
> >                             happening first.
> >
> >
> >                         I'm slowly pushing on the FastTLABRefill
> >                         (
> https://bugs.openjdk.java.net/browse/JDK-8194084),
> >                         I agree on keeping both separate for now though
> >                         so that we can think of both differently
> >
> >                                 Now, below, inlined are my answers:
> >
> >                                 On Fri, Feb 2, 2018 at 8:44 AM, Erik
> >                                 Österlund
> >                                 <erik.osterlund at oracle.com
> >                                 <mailto:erik.osterlund at oracle.com>>
> wrote:
> >
> >                                     Hi JC,
> >
> >                                     Hope I am reviewing the right
> >                                     version of your work. Here goes...
> >
> >
>  src/hotspot/share/gc/shared/collectedHeap.inline.hpp:
> >
> >                                        159
> >
>  AllocTracer::send_allocation_outside_tlab(klass, result, size *
> >                                     HeapWordSize, THREAD);
> >                                        160
> >                                        161
> >
>  THREAD->tlab().handle_sample(THREAD, result, size);
> >                                        162     return result;
> >                                        163   }
> >
> >                                     Should not call tlab()->X without
> >                                     checking if (UseTLAB) IMO.
> >
> >                                 Done!
> >
> >
> >                             More about this later.
> >
> >
>  src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:
> >
> >                                     So first of all, there seems to
> >                                     quite a few ends. There is an "end",
> >                                     a "hard
> >                                     end", a "slow path end", and an
> >                                     "actual end". Moreover, it seems
> >                                     like the
> >                                     "hard end" is actually further away
> >                                     than the "actual end". So the "hard
> end"
> >                                     seems like more of a "really
> >                                     definitely actual end" or something.
> >                                     I don't
> >                                     know about you, but I think it looks
> >                                     kind of messy. In particular, I don't
> >                                     feel like the name "actual end"
> >                                     reflects what it represents,
> >                                     especially when
> >                                     there is another end that is behind
> >                                     the "actual end".
> >
> >                                        413 HeapWord*
> >                                     ThreadLocalAllocBuffer::hard_end() {
> >                                        414   // Did a fast TLAB refill
> >                                     occur?
> >                                        415   if (_slow_path_end != _end)
> {
> >                                        416     // Fix up the actual end
> >                                     to be now the end of this TLAB.
> >                                        417     _slow_path_end = _end;
> >                                        418     _actual_end = _end;
> >                                        419   }
> >                                        420
> >                                        421   return _actual_end +
> >                                     alignment_reserve();
> >                                        422 }
> >
> >                                     I really do not like making getters
> >                                     unexpectedly have these kind of side
> >                                     effects. It is not expected that
> >                                     when you ask for the "hard end", you
> >                                     implicitly update the "slow path
> >                                     end" and "actual end" to new values.
> >
> >                                 As I said, a lot of this is due to the
> >                                 FastTlabRefill. If I make this
> >                                 not supporting FastTlabRefill, this goes
> >                                 away. The reason the system
> >                                 needs to update itself at the get is
> >                                 that you only know at that get if
> >                                 things have shifted underneath the tlab
> >                                 slow path. I am not sure of
> >                                 really better names (naming is hard!),
> >                                 perhaps we could do these
> >                                 names:
> >
> >                                 - current_tlab_end       // Either the
> >                                 allocated tlab end or a sampling point
> >                                 - last_allocation_address  // The end of
> >                                 the tlab allocation
> >                                 - last_slowpath_allocated_end  // In
> >                                 case a fast refill occurred the
> >                                 end might have changed, this is to
> >                                 remember slow vs fast past refills
> >
> >                                 the hard_end method can be renamed to
> >                                 something like:
> >                                 tlab_end_pointer()        // The end of
> >                                 the lab including a bit of
> >                                 alignment reserved bytes
> >
> >                             Those names sound better to me. Could you
> >                             please provide a mapping from the old names
> >                             to the new names so I understand which one
> >                             is which please?
> >
> >                             This is my current guess of what you are
> >                             proposing:
> >
> >                             end -> current_tlab_end
> >                             actual_end -> last_allocation_address
> >                             slow_path_end -> last_slowpath_allocated_end
> >                             hard_end -> tlab_end_pointer
> >
> >                         Yes that is correct, that was what I was
> proposing.
> >
> >                             I would prefer this naming:
> >
> >                             end -> slow_path_end // the end for taking a
> >                             slow path; either due to sampling or
> refilling
> >                             actual_end -> allocation_end // the end for
> >                             allocations
> >                             slow_path_end -> last_slow_path_end // last
> >                             address for slow_path_end (as opposed to
> >                             allocation_end)
> >                             hard_end -> reserved_end // the end of the
> >                             reserved space of the TLAB
> >
> >                             About setting things in the getter... that
> >                             still seems like a very unpleasant thing to
> >                             me. It would be better to inspect the call
> >                             hierarchy and explicitly update the ends
> >                             where they need updating, and assert in the
> >                             getter that they are in sync, rather than
> >                             implicitly setting various ends as a
> >                             surprising side effect in a getter. It looks
> >                             like the call hierarchy is very small. With
> >                             my new naming convention, reserved_end()
> >                             would presumably return _allocation_end +
> >                             alignment_reserve(), and have an assert
> >                             checking that _allocation_end ==
> >                             _last_slow_path_allocation_end, complaining
> >                             that this invariant must hold, and that a
> >                             caller to this function, such as
> >                             make_parsable(), must first explicitly
> >                             synchronize the ends as required, to honor
> >                             that invariant.
> >
> >
> >                         I've renamed the variables to how you preferred
> >                         it except for the _end one. I did:
> >
> >                         current_end
> >
> >                         last_allocation_address
> >
> >                         tlab_end_ptr
> >
> >                         The reason is that the architecture dependent
> >                         code use the thread.hpp API and it already has
> >                         tlab included into the name so it becomes
> >                         tlab_current_end (which is better that
> >                         tlab_current_tlab_end in my opinion).
> >
> >                         I also moved the update into a separate method
> >                         with a TODO that says to remove it when
> >                         FastTLABRefill is deprecated
> >
> >                     This looks a lot better now. Thanks.
> >
> >                     Note that the following comment now needs updating
> >                     accordingly in threadLocalAllocBuffer.hpp:
> >
> >                        41 //            Heap sampling is performed via
> >                     the end/actual_end fields.
> >
> >                        42 //            actual_end contains the real end
> >                     of the tlab allocation,
> >
> >                        43 //            whereas end can be set to an
> >                     arbitrary spot in the tlab to
> >
> >                        44 //            trip the return and sample the
> >                     allocation.
> >
> >                        45 //            slow_path_end is used to track
> >                     if a fast tlab refill occured
> >
> >                        46 //            between slowpath calls.
> >
> >                     There might be other comments too, I have not looked
> >                     in detail.
> >
> >                 This was the only spot that still had an actual_end, I
> >                 fixed it now. I'll do a sweep to double check other
> >                 comments.
> >
> >
> >
> >                                 Not sure it's better but before updating
> >                                 the webrev, I wanted to try
> >                                 to get input/consensus :)
> >
> >                                 (Note hard_end was always further off
> >                                 than end).
> >
> >                                     src/hotspot/share/prims/jvmti.xml:
> >
> >                                     10357       <capabilityfield
> >                                     id="can_sample_heap" since="9">
> >                                     10358         <description>
> >                                     10359           Can sample the heap.
> >                                     10360           If this capability
> >                                     is enabled then the heap sampling
> >                                     methods
> >                                     can be called.
> >                                     10361         </description>
> >                                     10362       </capabilityfield>
> >
> >                                     Looks like this capability should
> >                                     not be "since 9" if it gets
> integrated
> >                                     now.
> >
> >                                 Updated now to 11, crossing my fingers :)
> >
> >
>  src/hotspot/share/runtime/heapMonitoring.cpp:
> >
> >                                        448       if
> >                                     (is_alive->do_object_b(value)) {
> >                                        449         // Update the oop to
> >                                     point to the new object if it is
> still
> >                                     alive.
> >                                        450
>  f->do_oop(&(trace.obj));
> >                                        451
> >                                        452         // Copy the old
> >                                     trace, if it is still live.
> >                                        453
> >
>  _allocated_traces->at_put(curr_pos++, trace);
> >                                        454
> >                                        455         // Store the live
> >                                     trace in a cache, to be served up on
> >                                     /heapz.
> >                                        456
> >
>  _traces_on_last_full_gc->append(trace);
> >                                        457
> >                                        458         count++;
> >                                        459       } else {
> >                                        460         // If the old trace
> >                                     is no longer live, add it to the
> list of
> >                                        461         // recently collected
> >                                     garbage.
> >                                        462
> >                                       store_garbage_trace(trace);
> >                                        463       }
> >
> >                                     In the case where the oop was not
> >                                     live, I would like it to be
> explicitly
> >                                     cleared.
> >
> >                                 Done I think how you wanted it. Let me
> >                                 know because I'm not familiar
> >                                 with the RootAccess API. I'm unclear if
> >                                 I'm doing this right or not so
> >                                 reviews of these parts are highly
> >                                 appreciated. Robbin had talked of
> >                                 perhaps later pushing this all into a
> >                                 OopStorage, should I do this now
> >                                 do you think? Or can that wait a second
> >                                 webrev later down the road?
> >
> >                             I think using handles can and should be done
> >                             later. You can use the Access API now.
> >                             I noticed that you are missing an #include
> >                             "oops/access.inline.hpp" in your
> >                             heapMonitoring.cpp file.
> >
> >                         The missing header is there for me so I don't
> >                         know, I made sure it is present in the latest
> >                         webrev. Sorry about that.
> >
> >                                 + Did I clear it the way you wanted me
> >                                 to or were you thinking of
> >                                 something else?
> >
> >
> >                             That is precisely how I wanted it to be
> >                             cleared. Thanks.
> >
> >                                 + Final question here, seems like if I
> >                                 were to want to not do the
> >                                 f->do_oop directly on the trace.obj, I'd
> >                                 need to do something like:
> >
> >                                      f->do_oop(&value);
> >                                      ...
> >                                      trace->store_oop(value);
> >
> >                                 to update the oop internally. Is that
> >                                 right/is that one of the
> >                                 advantages of going to the Oopstorage
> >                                 sooner than later?
> >
> >
> >                             I think you really want to do the do_oop on
> >                             the root directly. Is there a particular
> >                             reason why you would not want to do that?
> >                             Otherwise, yes - the benefit with using the
> >                             handle approach is that you do not need to
> >                             call do_oop explicitly in your code.
> >
> >                         There is no reason except that now we have a
> >                         load_oop and a get_oop_addr, I was not sure what
> >                         you would think of that.
> >
> >                     That's fine.
> >
> >                                     Also I see a lot of
> >                                     concurrent-looking use of the
> >                                     following field:
> >                                        267   volatile bool _initialized;
> >
> >                                     Please note that the "volatile"
> >                                     qualifier does not help with
> reordering
> >                                     here. Reordering between volatile
> >                                     and non-volatile fields is
> >                                     completely free
> >                                     for both compiler and hardware,
> >                                     except for windows with MSVC, where
> >                                     volatile
> >                                     semantics is defined to use
> >                                     acquire/release semantics, and the
> >                                     hardware is
> >                                     TSO. But for the general case, I
> >                                     would expect this field to be stored
> >                                     with
> >                                     OrderAccess::release_store and
> >                                     loaded with
> OrderAccess::load_acquire.
> >                                     Otherwise it is not thread safe.
> >
> >                                 Because everything is behind a mutex, I
> >                                 wasn't really worried about
> >                                 this. I have a test that has multiple
> >                                 threads trying to hit this
> >                                 corner case and it passes.
> >
> >                                 However, to be paranoid, I updated it to
> >                                 using the OrderAccess API
> >                                 now, thanks! Let me know what you think
> >                                 there too!
> >
> >
> >                             If it is indeed always supposed to be read
> >                             and written under a mutex, then I would
> >                             strongly prefer to have it accessed as a
> >                             normal non-volatile member, and have an
> >                             assertion that given lock is held or we are
> >                             in a safepoint, as we do in many other
> >                             places. Something like this:
> >
> >
>  assert(HeapMonitorStorage_lock->owned_by_self()
> >                             || (SafepointSynchronize::is_at_safepoint()
> >                             && Thread::current()->is_VM_thread()), "this
> >                             should not be accessed concurrently");
> >
> >                             It would be confusing to people reading the
> >                             code if there are uses of OrderAccess that
> >                             are actually always protected under a mutex.
> >
> >                         Thank you for the exact example to be put in the
> >                         code! I put it around each access/assignment of
> >                         the _initialized method and found one case where
> >                         yes you can touch it and not have the lock. It
> >                         actually is "ok" because you don't act on the
> >                         storage until later and only when you really
> >                         want to modify the storage (see the
> >                         object_alloc_do_sample method which calls the
> >                         add_trace method).
> >
> >                         But, because of this, I'm going to put the
> >                         OrderAccess here, I'll do some performance
> >                         numbers later and if there are issues, I might
> >                         add a "unsafe" read and a "safe" one to make it
> >                         explicit to the reader. But I don't think it
> >                         will come to that.
> >
> >
> >                     Okay. This double return in heapMonitoring.cpp looks
> >                     wrong:
> >
> >                       283   bool initialized() {
> >                       284     return
> >                     OrderAccess::load_acquire(&_initialized) != 0;
> >                       285     return _initialized;
> >                       286   }
> >
> >                     Since you said object_alloc_do_sample() is the only
> >                     place where you do not hold the mutex while reading
> >                     initialized(), I had a closer look at that. It looks
> >                     like in its current shape, the lack of a mutex may
> >                     lead to a memory leak. In particular, it first
> >                     checks if (initialized()). Let's assume this is now
> >                     true. It then allocates a bunch of stuff, and checks
> >                     if the number of frames were over 0. If they were,
> >                     it calls StackTraceStorage::storage()->add_trace()
> >                     seemingly hoping that after grabbing the lock in
> >                     there, initialized() will still return true. But it
> >                     could now return false and skip doing anything, in
> >                     which case the allocated stuff will never be freed.
> >
> >                 I fixed this now by making add_trace return a boolean
> >                 and checking for that. It will be in the next webrev.
> >                 Thanks, the truth is that in our implementation the
> >                 system is always on or off, so this never really occurs
> >                 :). In this version though, that is not true and it's
> >                 important to handle so thanks again!
> >
> >
> >                     So the analysis seems to be that _initialized is
> >                     only used outside of the mutex in once instance,
> >                     where it is used to perform double-checked locking,
> >                     that actually causes a memory leak.
> >
> >                     I am not proposing how to fix that, just raising the
> >                     issue. If you still want to perform this
> >                     double-checked locking somehow, then the use of
> >                     acquire/release still seems odd. Because the memory
> >                     ordering restrictions of it never comes into play in
> >                     this particular case. If it ever did, then the use
> >                     of destroy_stuff(); release_store(_initialized, 0)
> >                     would be broken anyway as that would imply that
> >                     whatever concurrent reader there ever was would
> >                     after reading _initialized with load_acquire() could
> >                     *never* read the data that is concurrently destroyed
> >                     anyway. I would be biased to think that
> >                     RawAccess<MO_RELAXED>::load/store looks like a more
> >                     appropriate solution, given that the memory leak
> >                     issue is resolved. I do not know how painful it
> >                     would be to not perform this double-checked locking.
> >
> >                 So I agree with this entirely. I looked also a bit more
> >                 and the difference and code really stems from our
> >                 internal version. In this version however, there are
> >                 actually a lot of things going on that I did not go
> >                 entirely through in my head but this comment made me
> >                 ponder a bit more on it.
> >
> >                 Since every object_alloc_do_sample is protected by a
> >                 check to HeapMonitoring::enabled(), there is only a
> >                 small chance that the call is happening when things have
> >                 been disabled. So there is no real need to do a first
> >                 check on the initialized, it is a rare occurence that a
> >                 call happens to object_alloc_do_sample and the
> >                 initialized of the storage returns false.
> >
> >                 (By the way, even if you did call object_alloc_do_sample
> >                 without looking at HeapMonitoring::enabled(), that would
> >                 be ok too. You would gather the stacktrace and get
> >                 nowhere at the add_trace call, which would return false;
> >                 so though not optimal performance wise, nothing would
> >                 break).
> >
> >                 Furthermore, the add_trace is really the moment of no
> >                 return and we have the mutex lock and then the
> >                 initialized check. So, in the end, I did two things: I
> >                 removed that first check and then I removed the
> >                 OrderAccess for the storage initialized. I think now I
> >                 have a better grasp and understanding why it was done in
> >                 our code and why it is not needed here. Thanks for
> >                 pointing it out :). This now still passes my JTREG
> >                 tests, especially the threaded one.
> >
> >
> >
> >                                     As a kind of meta comment, I wonder
> >                                     if it would make sense to add
> sampling
> >                                     for non-TLAB allocations. Seems like
> >                                     if someone is rapidly allocating a
> >                                     whole bunch of 1 MB objects that
> >                                     never fit in a TLAB, I might still be
> >                                     interested in seeing that in my
> >                                     traces, and not get surprised that
> the
> >                                     allocation rate is very high yet not
> >                                     showing up in any profiles.
> >
> >                                 That is handled by the handle_sample
> >                                 where you wanted me to put a
> >                                 UseTlab because you hit that case if the
> >                                 allocation is too big.
> >
> >
> >                             I see. It was not obvious to me that
> >                             non-TLAB sampling is done in the TLAB class.
> >                             That seems like an abstraction crime.
> >                             What I wanted in my previous comment was
> >                             that we do not call into the TLAB when we
> >                             are not using TLABs. If there is sampling
> >                             logic in the TLAB that is used for something
> >                             else than TLABs, then it seems like that
> >                             logic simply does not belong inside of the
> >                             TLAB. It should be moved out of the TLAB,
> >                             and instead have the TLAB call this common
> >                             abstraction that makes sense.
> >
> >                         So in the incremental version:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
> >                         <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>,
> >                         this is still a "crime". The reason is that the
> >                         system has to have the bytes_until_sample on a
> >                         per-thread level and it made "sense" to have it
> >                         with the TLAB implementation. Also, I was not
> >                         sure how people felt about adding something to
> >                         the thread instance instead.
> >
> >                         Do you think it fits better at the Thread level?
> >                         I can see how difficult it is to make it happen
> >                         there and add some logic there. Let me know what
> >                         you think.
> >
> >
> >                     We have an unfortunate situation where everyone that
> >                     has some fields that are thread local tend to dump
> >                     them right into Thread, making the size and
> >                     complexity of Thread grow as it becomes tightly
> >                     coupled with various unrelated subsystems. It would
> >                     be desirable to have a separate class for this
> >                     instead that encapsulates the sampling logic. That
> >                     class could possibly reside in Thread though as a
> >                     value object of Thread.
> >
> >                 I imagined that would be the case but was not sure. I
> >                 will look at the example that Robbin is talking about
> >                 (ThreadSMR) and will see how to refactor my code to use
> >                 that.
> >
> >                 Thanks again for your help,
> >
> >                 Jc
> >
> >
> >
> >                             Hope I have answered your questions and that
> >                             my feedback makes sense to you.
> >
> >                         You have and thank you for them, I think we are
> >                         getting to a cleaner implementation and things
> >                         are getting better and more readable :)
> >
> >
> >                     Yes it is getting better.
> >
> >                     Thanks,
> >                     /Erik
> >
> >
> >
> >                         Thanks for your help!
> >
> >                         Jc
> >
> >                             Thanks,
> >                             /Erik
> >
> >                                 I double checked by changing the test
> >
> <http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20180405/f31eb717/attachment-0001.html>