JDK-8171119: Low-Overhead Heap Profiling
JC Beyler
jcbeyler at google.com
Thu Apr 5 18:15:56 UTC 2018
Thanks Boris and Derek for testing it.
Yes I was trying to get a new version out that had the tests ported as well
but got sidetracked while trying to add tests and two new features.
Here is the incremental webrev:
Here is the full webrev:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11/
Basically, the new tests assert this:
- Only one agent can currently ask for the sampling, I'm currently seeing
if I can push to a next webrev the multi-agent support to start doing a
code freeze on this one
- The event is not thread-enabled, meaning like the
VMObjectAllocationEvent, it's an all or nothing event; same as the
multi-agent, I'm going to see if a future webrev to add the support is a
better idea to freeze this webrev a bit
There was another item that I added here and I'm unsure this webrev is
stable in debug mode: I added an assertion system to ascertain that all
paths leading to a TLAB slow path (and hence a sampling point) have a
sampling collector ready to post the event if a user wants it. This might
break a few thing in debug mode as I'm working through the kinks of that as
well. However, in release mode, this new webrev passes all the tests in
hotspot/jtreg/serviceability/jvmti/HeapMonitor.
Let me know what you think,
Jc
On Thu, Apr 5, 2018 at 4:56 AM Boris Ulasevich <boris.ulasevich at bell-sw.com>
wrote:
> Hi JC,
>
> I have just checked on arm32: your patch compiles and runs ok.
>
> As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does not
> correspond to actual library name: libHeapMonitorTest.c ->
> libHeapMonitorTest.so
>
> Boris
>
> On 04.04.2018 01:54, White, Derek wrote:
> > Thanks JC,
> >
> > New patch applies cleanly. Compiles and runs (simple test programs) on
> > aarch64.
> >
> > * Derek
> >
> > *From:* JC Beyler [mailto:jcbeyler at google.com]
> > *Sent:* Monday, April 02, 2018 1:17 PM
> > *To:* White, Derek <Derek.White at cavium.com>
> > *Cc:* Erik Österlund <erik.osterlund at oracle.com>;
> > serviceability-dev at openjdk.java.net; hotspot-compiler-dev
> > <hotspot-compiler-dev at openjdk.java.net>
> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
> >
> > Hi Derek,
> >
> > I know there were a few things that went in that provoked a merge
> > conflict. I worked on it and got it up to date. Sadly my lack of
> > knowledge makes it a full rebase instead of keeping all the history.
> > However, with a newly cloned jdk/hs you should now be able to use:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
> >
> > The change you are referring to was done with the others so perhaps you
> > were unlucky and I forgot it in a webrev and fixed it in another? I
> > don't know but it's been there and I checked, it is here:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
> >
> > I double checked that tlab_end_offset no longer appears in any
> > architecture (as far as I can tell :)).
> >
> > Thanks for testing and let me know if you run into any other issues!
> >
> > Jc
> >
> > On Fri, Mar 30, 2018 at 4:24 PM White, Derek <Derek.White at cavium.com
> > <mailto:Derek.White at cavium.com>> wrote:
> >
> > Hi Jc,
> >
> > I’ve been having trouble getting your patch to apply correctly. I
> > may have based it on the wrong version.
> >
> > In any case, I think there’s a missing update to
> > macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
> > where “JavaThread::tlab_end_offset()” should become
> > “JavaThread::tlab_current_end_offset()”.
> >
> > This should correspond to the other port’s changes in
> > templateTable_<cpu>.cpp files.
> >
> > Thanks!
> > - Derek
> >
> > *From:* hotspot-compiler-dev
> > [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
> > <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On Behalf
> > Of *JC Beyler
> > *Sent:* Wednesday, March 28, 2018 11:43 AM
> > *To:* Erik Österlund <erik.osterlund at oracle.com
> > <mailto:erik.osterlund at oracle.com>>
> > *Cc:* serviceability-dev at openjdk.java.net
> > <mailto:serviceability-dev at openjdk.java.net>; hotspot-compiler-dev
> > <hotspot-compiler-dev at openjdk.java.net
> > <mailto:hotspot-compiler-dev at openjdk.java.net>>
> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
> >
> > Hi all,
> >
> > I've been working on deflaking the tests mostly and the wording in
> > the JVMTI spec.
> >
> > Here is the two incremental webrevs:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/
> >
> > Here is the total webrev:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/
> >
> > Here are the notes of this change:
> >
> > - Currently the tests pass 100 times in a row, I am working on
> > checking if they pass 1000 times in a row.
> >
> > - The default sampling rate is set to 512k, this is what we use
> > internally and having a default means that to enable the sampling
> > with the default, the user only has to do a enable event/disable
> > event via JVMTI (instead of enable + set sample rate).
> >
> > - I deprecated the code that was handling the fast path tlab
> > refill if it happened since this is now deprecated
> >
> > - Though I saw that Graal is still using it so I have to see
> > what needs to be done there exactly
> >
> > Finally, using the Dacapo benchmark suite, I noted a 1% overhead for
> > when the event system is turned on and the callback to the native
> > agent is just empty. I got a 3% overhead with a 512k sampling rate
> > with the code I put in the native side of my tests.
> >
> > Thanks and comments are appreciated,
> >
> > Jc
> >
> > On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <jcbeyler at google.com
> > <mailto:jcbeyler at google.com>> wrote:
> >
> > Hi all,
> >
> > The incremental webrev update is here:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/
> >
> > The full webrev is here:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/
> >
> > Major change here is:
> >
> > - I've removed the heapMonitoring.cpp code in favor of just
> > having the sampling events as per Serguei's request; I still
> > have to do some overhead measurements but the tests prove the
> > concept can work
> >
> > - Most of the tlab code is unchanged, the only major
> > part is that now things get sent off to event collectors when
> > used and enabled.
> >
> > - Added the interpreter collectors to handle interpreter
> > execution
> >
> > - Updated the name from SetTlabHeapSampling to
> > SetHeapSampling to be more generic
> >
> > - Added a mutex for the thread sampling so that we can
> > initialize an internal static array safely
> >
> > - Ported the tests from the old system to this new one
> >
> > I've also updated the JEP and CSR to reflect these changes:
> >
> > https://bugs.openjdk.java.net/browse/JDK-8194905
> >
> > https://bugs.openjdk.java.net/browse/JDK-8171119
> >
> > In order to make this have some forward progress, I've removed
> > the heap sampling code entirely and now rely entirely on the
> > event sampling system. The tests reflect this by using a
> > simplified implementation of what an agent could do:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
> >
> > (Search for anything mentioning event_storage).
> >
> > I have not taken the time to port the whole code we had
> > originally in heapMonitoring to this. I hesitate only because
> > that code was in C++, I'd have to port it to C and this is for
> > tests so perhaps what I have now is good enough?
> >
> > As far as testing goes, I've ported all the relevant tests and
> > then added a few:
> >
> > - Turning the system on/off
> >
> > - Testing using various GCs
> >
> > - Testing using the interpreter
> >
> > - Testing the sampling rate
> >
> > - Testing with objects and arrays
> >
> > - Testing with various threads
> >
> > Finally, as overhead goes, I have the numbers of the system off
> > vs a clean build and I have 0% overhead, which is what we'd
> > want. This was using the Dacapo benchmarks. I am now preparing
> > to run a version with the events on using dacapo and will report
> > back here.
> >
> > Any comments are welcome :)
> >
> > Jc
> >
> > On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <jcbeyler at google.com
> > <mailto:jcbeyler at google.com>> wrote:
> >
> > Hi all,
> >
> > I apologize for the delay but I wanted to add an event
> > system and that took a bit longer than expected and I also
> > reworked the code to take into account the deprecation of
> > FastTLABRefill.
> >
> > This update has four parts:
> >
> > A) I moved the implementation from Thread to
> > ThreadHeapSampler inside of Thread. Would you prefer it as a
> > pointer inside of Thread or like this works for you? Second
> > question would be would you rather have an association
> > outside of Thread altogether that tries to remember when
> > threads are live and then we would have something like:
> >
> > ThreadHeapSampler::get_sampling_size(this_thread);
> >
> > I worry about the overhead of this but perhaps it is not too
> > too bad?
> >
> > B) I also have been working on the Allocation event system
> > that sends out a notification at each sampled event. This
> > will be practical when wanting to do something at the
> > allocation point. I'm also looking at if the whole
> > heapMonitoring code could not reside in the agent code and
> > not in the JDK. I'm not convinced but I'm talking to Serguei
> > about it to see/assess :)
> >
> > - Also added two tests for the new event subsystem
> >
> > C) Removed the slow_path fields inside the TLAB code since
> > now FastTLABRefill is deprecated
> >
> > D) Updated the JVMTI documentation and specification for the
> > methods.
> >
> > So the incremental webrev is here:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/
> >
> > and the full webrev is here:
> >
> > http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10
> >
> > I believe I have updated the various JIRA issues that track
> > this :)
> >
> > Thanks for your input,
> >
> > Jc
> >
> > On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler
> > <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> >
> > Hi Erik,
> >
> > I inlined my answers, which the last one seems to answer
> > Robbin's concerns about the same thing (adding things to
> > Thread).
> >
> > On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund
> > <erik.osterlund at oracle.com
> > <mailto:erik.osterlund at oracle.com>> wrote:
> >
> > Hi JC,
> >
> > Comments are inlined below.
> >
> > On 2018-02-13 06:18, JC Beyler wrote:
> >
> > Hi Erik,
> >
> > Thanks for your answers, I've now inlined my own
> > answers/comments.
> >
> > I've done a new webrev here:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
> > <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>
> >
> > The incremental is here:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
> > <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>
> >
> > Note to all:
> >
> > - I've been integrating changes from
> > Erin/Serguei/David comments so this webrev
> > incremental is a bit an answer to all comments
> > in one. I apologize for that :)
> >
> > On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund
> > <erik.osterlund at oracle.com
> > <mailto:erik.osterlund at oracle.com>> wrote:
> >
> > Hi JC,
> >
> > Sorry for the delayed reply.
> >
> > Inlined answers:
> >
> >
> >
> > On 2018-02-06 00:04, JC Beyler wrote:
> >
> > Hi Erik,
> >
> > (Renaming this to be folded into the
> > newly renamed thread :))
> >
> > First off, thanks a lot for reviewing
> > the webrev! I appreciate it!
> >
> > I updated the webrev to:
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
> > <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>
> >
> > And the incremental one is here:
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
> > <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>
> >
> > It contains:
> > - The change for since from 9 to 11 for
> > the jvmti.xml
> > - The use of the OrderAccess for
> initialized
> > - Clearing the oop
> >
> > I also have inlined my answers to your
> > comments. The biggest question
> > will come from the multiple *_end
> > variables. A bit of the logic there
> > is due to handling the slow path refill
> > vs fast path refill and
> > checking that the rug was not pulled
> > underneath the slowpath. I
> > believe that a previous comment was that
> > TlabFastRefill was going to
> > be deprecated.
> >
> > If this is true, we could revert this
> > code a bit and just do a : if
> > TlabFastRefill is enabled, disable this.
> > And then deprecate that when
> > TlabFastRefill is deprecated.
> >
> > This might simplify this webrev and I
> > can work on a follow-up that
> > either: removes TlabFastRefill if Robbin
> > does not have the time to do
> > it or add the support to the assembly
> > side to handle this correctly.
> > What do you think?
> >
> > I support removing TlabFastRefill, but I
> > think it is good to not depend on that
> > happening first.
> >
> >
> > I'm slowly pushing on the FastTLABRefill
> > (
> https://bugs.openjdk.java.net/browse/JDK-8194084),
> > I agree on keeping both separate for now though
> > so that we can think of both differently
> >
> > Now, below, inlined are my answers:
> >
> > On Fri, Feb 2, 2018 at 8:44 AM, Erik
> > Österlund
> > <erik.osterlund at oracle.com
> > <mailto:erik.osterlund at oracle.com>>
> wrote:
> >
> > Hi JC,
> >
> > Hope I am reviewing the right
> > version of your work. Here goes...
> >
> >
> src/hotspot/share/gc/shared/collectedHeap.inline.hpp:
> >
> > 159
> >
> AllocTracer::send_allocation_outside_tlab(klass, result, size *
> > HeapWordSize, THREAD);
> > 160
> > 161
> >
> THREAD->tlab().handle_sample(THREAD, result, size);
> > 162 return result;
> > 163 }
> >
> > Should not call tlab()->X without
> > checking if (UseTLAB) IMO.
> >
> > Done!
> >
> >
> > More about this later.
> >
> >
> src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:
> >
> > So first of all, there seems to
> > quite a few ends. There is an "end",
> > a "hard
> > end", a "slow path end", and an
> > "actual end". Moreover, it seems
> > like the
> > "hard end" is actually further away
> > than the "actual end". So the "hard
> end"
> > seems like more of a "really
> > definitely actual end" or something.
> > I don't
> > know about you, but I think it looks
> > kind of messy. In particular, I don't
> > feel like the name "actual end"
> > reflects what it represents,
> > especially when
> > there is another end that is behind
> > the "actual end".
> >
> > 413 HeapWord*
> > ThreadLocalAllocBuffer::hard_end() {
> > 414 // Did a fast TLAB refill
> > occur?
> > 415 if (_slow_path_end != _end)
> {
> > 416 // Fix up the actual end
> > to be now the end of this TLAB.
> > 417 _slow_path_end = _end;
> > 418 _actual_end = _end;
> > 419 }
> > 420
> > 421 return _actual_end +
> > alignment_reserve();
> > 422 }
> >
> > I really do not like making getters
> > unexpectedly have these kind of side
> > effects. It is not expected that
> > when you ask for the "hard end", you
> > implicitly update the "slow path
> > end" and "actual end" to new values.
> >
> > As I said, a lot of this is due to the
> > FastTlabRefill. If I make this
> > not supporting FastTlabRefill, this goes
> > away. The reason the system
> > needs to update itself at the get is
> > that you only know at that get if
> > things have shifted underneath the tlab
> > slow path. I am not sure of
> > really better names (naming is hard!),
> > perhaps we could do these
> > names:
> >
> > - current_tlab_end // Either the
> > allocated tlab end or a sampling point
> > - last_allocation_address // The end of
> > the tlab allocation
> > - last_slowpath_allocated_end // In
> > case a fast refill occurred the
> > end might have changed, this is to
> > remember slow vs fast past refills
> >
> > the hard_end method can be renamed to
> > something like:
> > tlab_end_pointer() // The end of
> > the lab including a bit of
> > alignment reserved bytes
> >
> > Those names sound better to me. Could you
> > please provide a mapping from the old names
> > to the new names so I understand which one
> > is which please?
> >
> > This is my current guess of what you are
> > proposing:
> >
> > end -> current_tlab_end
> > actual_end -> last_allocation_address
> > slow_path_end -> last_slowpath_allocated_end
> > hard_end -> tlab_end_pointer
> >
> > Yes that is correct, that was what I was
> proposing.
> >
> > I would prefer this naming:
> >
> > end -> slow_path_end // the end for taking a
> > slow path; either due to sampling or
> refilling
> > actual_end -> allocation_end // the end for
> > allocations
> > slow_path_end -> last_slow_path_end // last
> > address for slow_path_end (as opposed to
> > allocation_end)
> > hard_end -> reserved_end // the end of the
> > reserved space of the TLAB
> >
> > About setting things in the getter... that
> > still seems like a very unpleasant thing to
> > me. It would be better to inspect the call
> > hierarchy and explicitly update the ends
> > where they need updating, and assert in the
> > getter that they are in sync, rather than
> > implicitly setting various ends as a
> > surprising side effect in a getter. It looks
> > like the call hierarchy is very small. With
> > my new naming convention, reserved_end()
> > would presumably return _allocation_end +
> > alignment_reserve(), and have an assert
> > checking that _allocation_end ==
> > _last_slow_path_allocation_end, complaining
> > that this invariant must hold, and that a
> > caller to this function, such as
> > make_parsable(), must first explicitly
> > synchronize the ends as required, to honor
> > that invariant.
> >
> >
> > I've renamed the variables to how you preferred
> > it except for the _end one. I did:
> >
> > current_end
> >
> > last_allocation_address
> >
> > tlab_end_ptr
> >
> > The reason is that the architecture dependent
> > code use the thread.hpp API and it already has
> > tlab included into the name so it becomes
> > tlab_current_end (which is better that
> > tlab_current_tlab_end in my opinion).
> >
> > I also moved the update into a separate method
> > with a TODO that says to remove it when
> > FastTLABRefill is deprecated
> >
> > This looks a lot better now. Thanks.
> >
> > Note that the following comment now needs updating
> > accordingly in threadLocalAllocBuffer.hpp:
> >
> > 41 // Heap sampling is performed via
> > the end/actual_end fields.
> >
> > 42 // actual_end contains the real end
> > of the tlab allocation,
> >
> > 43 // whereas end can be set to an
> > arbitrary spot in the tlab to
> >
> > 44 // trip the return and sample the
> > allocation.
> >
> > 45 // slow_path_end is used to track
> > if a fast tlab refill occured
> >
> > 46 // between slowpath calls.
> >
> > There might be other comments too, I have not looked
> > in detail.
> >
> > This was the only spot that still had an actual_end, I
> > fixed it now. I'll do a sweep to double check other
> > comments.
> >
> >
> >
> > Not sure it's better but before updating
> > the webrev, I wanted to try
> > to get input/consensus :)
> >
> > (Note hard_end was always further off
> > than end).
> >
> > src/hotspot/share/prims/jvmti.xml:
> >
> > 10357 <capabilityfield
> > id="can_sample_heap" since="9">
> > 10358 <description>
> > 10359 Can sample the heap.
> > 10360 If this capability
> > is enabled then the heap sampling
> > methods
> > can be called.
> > 10361 </description>
> > 10362 </capabilityfield>
> >
> > Looks like this capability should
> > not be "since 9" if it gets
> integrated
> > now.
> >
> > Updated now to 11, crossing my fingers :)
> >
> >
> src/hotspot/share/runtime/heapMonitoring.cpp:
> >
> > 448 if
> > (is_alive->do_object_b(value)) {
> > 449 // Update the oop to
> > point to the new object if it is
> still
> > alive.
> > 450
> f->do_oop(&(trace.obj));
> > 451
> > 452 // Copy the old
> > trace, if it is still live.
> > 453
> >
> _allocated_traces->at_put(curr_pos++, trace);
> > 454
> > 455 // Store the live
> > trace in a cache, to be served up on
> > /heapz.
> > 456
> >
> _traces_on_last_full_gc->append(trace);
> > 457
> > 458 count++;
> > 459 } else {
> > 460 // If the old trace
> > is no longer live, add it to the
> list of
> > 461 // recently collected
> > garbage.
> > 462
> > store_garbage_trace(trace);
> > 463 }
> >
> > In the case where the oop was not
> > live, I would like it to be
> explicitly
> > cleared.
> >
> > Done I think how you wanted it. Let me
> > know because I'm not familiar
> > with the RootAccess API. I'm unclear if
> > I'm doing this right or not so
> > reviews of these parts are highly
> > appreciated. Robbin had talked of
> > perhaps later pushing this all into a
> > OopStorage, should I do this now
> > do you think? Or can that wait a second
> > webrev later down the road?
> >
> > I think using handles can and should be done
> > later. You can use the Access API now.
> > I noticed that you are missing an #include
> > "oops/access.inline.hpp" in your
> > heapMonitoring.cpp file.
> >
> > The missing header is there for me so I don't
> > know, I made sure it is present in the latest
> > webrev. Sorry about that.
> >
> > + Did I clear it the way you wanted me
> > to or were you thinking of
> > something else?
> >
> >
> > That is precisely how I wanted it to be
> > cleared. Thanks.
> >
> > + Final question here, seems like if I
> > were to want to not do the
> > f->do_oop directly on the trace.obj, I'd
> > need to do something like:
> >
> > f->do_oop(&value);
> > ...
> > trace->store_oop(value);
> >
> > to update the oop internally. Is that
> > right/is that one of the
> > advantages of going to the Oopstorage
> > sooner than later?
> >
> >
> > I think you really want to do the do_oop on
> > the root directly. Is there a particular
> > reason why you would not want to do that?
> > Otherwise, yes - the benefit with using the
> > handle approach is that you do not need to
> > call do_oop explicitly in your code.
> >
> > There is no reason except that now we have a
> > load_oop and a get_oop_addr, I was not sure what
> > you would think of that.
> >
> > That's fine.
> >
> > Also I see a lot of
> > concurrent-looking use of the
> > following field:
> > 267 volatile bool _initialized;
> >
> > Please note that the "volatile"
> > qualifier does not help with
> reordering
> > here. Reordering between volatile
> > and non-volatile fields is
> > completely free
> > for both compiler and hardware,
> > except for windows with MSVC, where
> > volatile
> > semantics is defined to use
> > acquire/release semantics, and the
> > hardware is
> > TSO. But for the general case, I
> > would expect this field to be stored
> > with
> > OrderAccess::release_store and
> > loaded with
> OrderAccess::load_acquire.
> > Otherwise it is not thread safe.
> >
> > Because everything is behind a mutex, I
> > wasn't really worried about
> > this. I have a test that has multiple
> > threads trying to hit this
> > corner case and it passes.
> >
> > However, to be paranoid, I updated it to
> > using the OrderAccess API
> > now, thanks! Let me know what you think
> > there too!
> >
> >
> > If it is indeed always supposed to be read
> > and written under a mutex, then I would
> > strongly prefer to have it accessed as a
> > normal non-volatile member, and have an
> > assertion that given lock is held or we are
> > in a safepoint, as we do in many other
> > places. Something like this:
> >
> >
> assert(HeapMonitorStorage_lock->owned_by_self()
> > || (SafepointSynchronize::is_at_safepoint()
> > && Thread::current()->is_VM_thread()), "this
> > should not be accessed concurrently");
> >
> > It would be confusing to people reading the
> > code if there are uses of OrderAccess that
> > are actually always protected under a mutex.
> >
> > Thank you for the exact example to be put in the
> > code! I put it around each access/assignment of
> > the _initialized method and found one case where
> > yes you can touch it and not have the lock. It
> > actually is "ok" because you don't act on the
> > storage until later and only when you really
> > want to modify the storage (see the
> > object_alloc_do_sample method which calls the
> > add_trace method).
> >
> > But, because of this, I'm going to put the
> > OrderAccess here, I'll do some performance
> > numbers later and if there are issues, I might
> > add a "unsafe" read and a "safe" one to make it
> > explicit to the reader. But I don't think it
> > will come to that.
> >
> >
> > Okay. This double return in heapMonitoring.cpp looks
> > wrong:
> >
> > 283 bool initialized() {
> > 284 return
> > OrderAccess::load_acquire(&_initialized) != 0;
> > 285 return _initialized;
> > 286 }
> >
> > Since you said object_alloc_do_sample() is the only
> > place where you do not hold the mutex while reading
> > initialized(), I had a closer look at that. It looks
> > like in its current shape, the lack of a mutex may
> > lead to a memory leak. In particular, it first
> > checks if (initialized()). Let's assume this is now
> > true. It then allocates a bunch of stuff, and checks
> > if the number of frames were over 0. If they were,
> > it calls StackTraceStorage::storage()->add_trace()
> > seemingly hoping that after grabbing the lock in
> > there, initialized() will still return true. But it
> > could now return false and skip doing anything, in
> > which case the allocated stuff will never be freed.
> >
> > I fixed this now by making add_trace return a boolean
> > and checking for that. It will be in the next webrev.
> > Thanks, the truth is that in our implementation the
> > system is always on or off, so this never really occurs
> > :). In this version though, that is not true and it's
> > important to handle so thanks again!
> >
> >
> > So the analysis seems to be that _initialized is
> > only used outside of the mutex in once instance,
> > where it is used to perform double-checked locking,
> > that actually causes a memory leak.
> >
> > I am not proposing how to fix that, just raising the
> > issue. If you still want to perform this
> > double-checked locking somehow, then the use of
> > acquire/release still seems odd. Because the memory
> > ordering restrictions of it never comes into play in
> > this particular case. If it ever did, then the use
> > of destroy_stuff(); release_store(_initialized, 0)
> > would be broken anyway as that would imply that
> > whatever concurrent reader there ever was would
> > after reading _initialized with load_acquire() could
> > *never* read the data that is concurrently destroyed
> > anyway. I would be biased to think that
> > RawAccess<MO_RELAXED>::load/store looks like a more
> > appropriate solution, given that the memory leak
> > issue is resolved. I do not know how painful it
> > would be to not perform this double-checked locking.
> >
> > So I agree with this entirely. I looked also a bit more
> > and the difference and code really stems from our
> > internal version. In this version however, there are
> > actually a lot of things going on that I did not go
> > entirely through in my head but this comment made me
> > ponder a bit more on it.
> >
> > Since every object_alloc_do_sample is protected by a
> > check to HeapMonitoring::enabled(), there is only a
> > small chance that the call is happening when things have
> > been disabled. So there is no real need to do a first
> > check on the initialized, it is a rare occurence that a
> > call happens to object_alloc_do_sample and the
> > initialized of the storage returns false.
> >
> > (By the way, even if you did call object_alloc_do_sample
> > without looking at HeapMonitoring::enabled(), that would
> > be ok too. You would gather the stacktrace and get
> > nowhere at the add_trace call, which would return false;
> > so though not optimal performance wise, nothing would
> > break).
> >
> > Furthermore, the add_trace is really the moment of no
> > return and we have the mutex lock and then the
> > initialized check. So, in the end, I did two things: I
> > removed that first check and then I removed the
> > OrderAccess for the storage initialized. I think now I
> > have a better grasp and understanding why it was done in
> > our code and why it is not needed here. Thanks for
> > pointing it out :). This now still passes my JTREG
> > tests, especially the threaded one.
> >
> >
> >
> > As a kind of meta comment, I wonder
> > if it would make sense to add
> sampling
> > for non-TLAB allocations. Seems like
> > if someone is rapidly allocating a
> > whole bunch of 1 MB objects that
> > never fit in a TLAB, I might still be
> > interested in seeing that in my
> > traces, and not get surprised that
> the
> > allocation rate is very high yet not
> > showing up in any profiles.
> >
> > That is handled by the handle_sample
> > where you wanted me to put a
> > UseTlab because you hit that case if the
> > allocation is too big.
> >
> >
> > I see. It was not obvious to me that
> > non-TLAB sampling is done in the TLAB class.
> > That seems like an abstraction crime.
> > What I wanted in my previous comment was
> > that we do not call into the TLAB when we
> > are not using TLABs. If there is sampling
> > logic in the TLAB that is used for something
> > else than TLABs, then it seems like that
> > logic simply does not belong inside of the
> > TLAB. It should be moved out of the TLAB,
> > and instead have the TLAB call this common
> > abstraction that makes sense.
> >
> > So in the incremental version:
> >
> >
> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
> > <
> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>,
> > this is still a "crime". The reason is that the
> > system has to have the bytes_until_sample on a
> > per-thread level and it made "sense" to have it
> > with the TLAB implementation. Also, I was not
> > sure how people felt about adding something to
> > the thread instance instead.
> >
> > Do you think it fits better at the Thread level?
> > I can see how difficult it is to make it happen
> > there and add some logic there. Let me know what
> > you think.
> >
> >
> > We have an unfortunate situation where everyone that
> > has some fields that are thread local tend to dump
> > them right into Thread, making the size and
> > complexity of Thread grow as it becomes tightly
> > coupled with various unrelated subsystems. It would
> > be desirable to have a separate class for this
> > instead that encapsulates the sampling logic. That
> > class could possibly reside in Thread though as a
> > value object of Thread.
> >
> > I imagined that would be the case but was not sure. I
> > will look at the example that Robbin is talking about
> > (ThreadSMR) and will see how to refactor my code to use
> > that.
> >
> > Thanks again for your help,
> >
> > Jc
> >
> >
> >
> > Hope I have answered your questions and that
> > my feedback makes sense to you.
> >
> > You have and thank you for them, I think we are
> > getting to a cleaner implementation and things
> > are getting better and more readable :)
> >
> >
> > Yes it is getting better.
> >
> > Thanks,
> > /Erik
> >
> >
> >
> > Thanks for your help!
> >
> > Jc
> >
> > Thanks,
> > /Erik
> >
> > I double checked by changing the test
> >
> <http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20180405/f31eb717/attachment-0001.html>
More information about the serviceability-dev
mailing list