JDK-8171119: Low-Overhead Heap Profiling

Fri Apr 6 23:12:52 UTC 2018

Hi Karen,

Let me inline my answers, it will probably be easier :)

On Fri, Apr 6, 2018 at 2:40 PM Karen Kinnear <karen.kinnear at oracle.com>
wrote:

> JC,
>
> Thank you for the updates - really glad you are including the compiler
> folks. I reviewed the version before this one, so ignore
> any comments you’ve already covered (although I did peek at the latest)
>
> 1. JDK-8194905 CSR - could you please delete attachments that are not
> current. It is a bit confusing right now.
> I have been looking at jvmti_event6.html. I am assuming the rest are
> obsolete and could be removed please.
>

I tried but it does not allow me to do. It seems that someone with
admistrative rights has to do it :-(. That is why I had to resort to this...

>
> 2. In jvmti_event6.html under Sampled Object Allocation, there is a link
> to “Heap Sampling Monitoring System”.
> It takes me to the top of the page - seems like something is missing in
> defining it?
>

So the Heap Sampling Monitoring System used to have more methods. It made
sense to have them in a separate category. I now have moved it to the
memory category to be consistent and grouped there. I also removed that
link btw.

>
> 3. Scope of memory allocation tracking
>
> I am struggling to understand the extent of memory allocation tracking
> that you are looking for (probably want to
> clarify in the JEP and CSR once we work this through).
>
> e.g. Heap Sampler vs. JVMTI VMObjectAllocEvent
> So the current jvmtiVMObjectAllocEvent says:
>
> Sent when a method causes the VM to allocate an Object visible to Java
> and allocation not detectable by other instrumentation mechanisms.
> Generally detect by instrumenting bytecodes of allocating methods
> JNI - use JNI function interception
> e.g. Reflection: java.lang.Class.newInstance()
> e.g. VM intrinsics
>
> comment:
> Not generated due to bytecodes - e.g. new and newarray VM instructions
> Not allocation due to JNI: e.g. AllocObject
> NOT VM internal objects
> NOT allocations during VM init
>
> So from the JEP I can’t tell the intended scope of the new event - is this
> intended to cover all heap allocation?
>     bytecodes
>     JVM_*
>    JNI_*
>    internal VM objects
>    other? (I’m not sure what other there are)
>    - I presume not allocations during VM init - since sent only during
> live phase
>

Yes exactly, as much as possible, I am aiming to cover all heap
allocations. Mostly though and in practice, I think we care about bytecodes
and to a lesser extend JNI. In being independent of why the memory is being
allocated is probably even better: this thread allocated Y, no matter
where/why that ones.

>
> OR - is the primary goal to cover allocation for bytecodes so folks can
> skip instrumentation?
>

Yes that is the primary goal.

> OR - do you want to get performance numbers and see what is low enough
> overhead before deciding?
>

I think it is the same, the system is relatively in place and my overhead
seems to indicate that there is a 0% off, 1% on but the callback to the
user is empty, 3% for a naive implementation tracking live/GC'd objects.

>
> 4. The design question is where to put the collectors in the source base -
> and that of course strongly depends on
> the scope of the information you want to collect, and on the performance
> overhead we are willing to incur.
>

Very true.

>
> I was trying to figure out a way to put the collectors farther down the
> call stack so as to both catch more
> cases and to reduce the maintenance burden - i.e. if you were to add a new
> code generator, e.g. Graal -
> if it were to go through an existing interface, that might be a place to
> already have a collector.
>
> I do not know the Graal sources - I did look at jvmci/jvmciRuntime.cpp -
> and it appears that there
> are calls to instanceKlass::new_instance,
> oopFactory::new_typeArray/new_ObjArray and ArrayKlass::multi-allocate,
> so one possibility would be to put hooks in those calls which would catch
> many? (I did not do a thorough search)
> of the slowpath calls for the bytecodes, and then check the fast paths in
> detail.
>

I'll come to a major issue with the collector and its placement in the next
paragraph.

>
> I had wondered if it made sense to move the hooks even farther down, into
> CollectedHeap:obj_allocate and array_allocate.
> I do not think so. First reason is that for multidimensional arrays,
> ArrayKlass::multi_allocate the outer dimension array would
> have an event before storing the inner sub-arrays and I don’t think we
> want that exposed, so that won’t work for arrays.
>

So the major difficulty is that the steps of collection do this:

- An object gets allocated and is decided to be sampled
- The original pointer placement (where it resides originally in memory) is
passed to the collector
- Now one important thing of note:
    (a) In the VM code, until the point where the oop is going to be
returned, GC is not yet aware of it
    (b) so the collector can't yet send it out to the user via JVMTI
otherwise, the agent could put a weak reference for example

I'm a bit fuzzy on this and maybe it's just that there would be more heavy
lifting to make this possible but my initial tests seem to show problems
when attempting this in the obj_allocate area.

>
> The second reason is that I strongly suspect the scope you want is
> bytecodes only. I think once you have added hooks
> to all the fast paths and slow paths that this will be pushing the
> performance overhead constraints you proposed and
> you won’t want to see e.g. internal allocations.
>

Yes agreed, allocations from bytecodes are mostly our concern generally :)

>
>
But I think you need to experiment with the set of allocations (or possible
> alternative sets of allocations) you want recorded.
>
> The hooks I see today include:
> Interpreter: (looking at x86 as a sample)
>   - slowpath in InterpreterRuntime
>   - fastpath tlab allocation - your new threshold check handles that
>

Agreed

>   - allow_shared_alloc (GC specific): for _new isn’t handled
>

Where is that exactly? I can check why we are not catching it?

>
> C1
>   I don’t see changes in c1_Runtime.cpp
>   note: you also want to look for the fast path
>

I added the calls to c1_Runtime in the latest webrev, but was still going
through testing before pushing it out. I had waited on this one a bit. Fast
path would be handled by the threshold check no?

> C2: changes in opto/runtime.cpp for slow path
>    did you also catch the fast path?
>

Fast path gets handled by the same threshold check, no? Perhaps I've
missed something (very likely)?

>
> 3. Performance -
> After you get all the collectors added - you need to rerun the performance
> numbers.
>

Agreed :)

>
> thanks,
> Karen
>
> On Apr 5, 2018, at 2:15 PM, JC Beyler <jcbeyler at google.com> wrote:
>
> Thanks Boris and Derek for testing it.
>
> Yes I was trying to get a new version out that had the tests ported as
> well but got sidetracked while trying to add tests and two new features.
>
> Here is the incremental webrev:
>
> Here is the full webrev:
> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11/
>
> Basically, the new tests assert this:
>   - Only one agent can currently ask for the sampling, I'm currently
> seeing if I can push to a next webrev the multi-agent support to start
> doing a code freeze on this one
>   - The event is not thread-enabled, meaning like the
> VMObjectAllocationEvent, it's an all or nothing event; same as the
> multi-agent, I'm going to see if a future webrev to add the support is a
> better idea to freeze this webrev a bit
>
> There was another item that I added here and I'm unsure this webrev is
> stable in debug mode: I added an assertion system to ascertain that all
> paths leading to a TLAB slow path (and hence a sampling point) have a
> sampling collector ready to post the event if a user wants it. This might
> break a few thing in debug mode as I'm working through the kinks of that as
> well. However, in release mode, this new webrev passes all the tests in
> hotspot/jtreg/serviceability/jvmti/HeapMonitor.
>
> Let me know what you think,
> Jc
>
> On Thu, Apr 5, 2018 at 4:56 AM Boris Ulasevich <
> boris.ulasevich at bell-sw.com> wrote:
>
>> Hi JC,
>>
>>    I have just checked on arm32: your patch compiles and runs ok.
>>
>>    As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does not
>> correspond to actual library name: libHeapMonitorTest.c ->
>> libHeapMonitorTest.so
>>
>> Boris
>>
>> On 04.04.2018 01:54, White, Derek wrote:
>> > Thanks JC,
>> >
>> > New patch applies cleanly. Compiles and runs (simple test programs) on
>> > aarch64.
>> >
>> >   * Derek
>> >
>> > *From:* JC Beyler [mailto:jcbeyler at google.com]
>> > *Sent:* Monday, April 02, 2018 1:17 PM
>> > *To:* White, Derek <Derek.White at cavium.com>
>> > *Cc:* Erik Österlund <erik.osterlund at oracle.com>;
>> > serviceability-dev at openjdk.java.net; hotspot-compiler-dev
>> > <hotspot-compiler-dev at openjdk.java.net>
>> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>> >
>> > Hi Derek,
>> >
>> > I know there were a few things that went in that provoked a merge
>> > conflict. I worked on it and got it up to date. Sadly my lack of
>> > knowledge makes it a full rebase instead of keeping all the history.
>> > However, with a newly cloned jdk/hs you should now be able to use:
>> >
>> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
>> >
>> > The change you are referring to was done with the others so perhaps you
>> > were unlucky and I forgot it in a webrev and fixed it in another? I
>> > don't know but it's been there and I checked, it is here:
>> >
>> >
>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
>> >
>> > I double checked that tlab_end_offset no longer appears in any
>> > architecture (as far as I can tell :)).
>> >
>> > Thanks for testing and let me know if you run into any other issues!
>> >
>> > Jc
>> >
>> > On Fri, Mar 30, 2018 at 4:24 PM White, Derek <Derek.White at cavium.com
>> > <mailto:Derek.White at cavium.com>> wrote:
>> >
>> >     Hi Jc,
>> >
>> >     I’ve been having trouble getting your patch to apply correctly. I
>> >     may have based it on the wrong version.
>> >
>> >     In any case, I think there’s a missing update to
>> >     macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
>> >     where “JavaThread::tlab_end_offset()” should become
>> >     “JavaThread::tlab_current_end_offset()”.
>> >
>> >     This should correspond to the other port’s changes in
>> >     templateTable_<cpu>.cpp files.
>> >
>> >     Thanks!
>> >     - Derek
>> >
>> >     *From:* hotspot-compiler-dev
>> >     [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
>> >     <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On Behalf
>> >     Of *JC Beyler
>> >     *Sent:* Wednesday, March 28, 2018 11:43 AM
>> >     *To:* Erik Österlund <erik.osterlund at oracle.com
>> >     <mailto:erik.osterlund at oracle.com>>
>> >     *Cc:* serviceability-dev at openjdk.java.net
>> >     <mailto:serviceability-dev at openjdk.java.net>; hotspot-compiler-dev
>> >     <hotspot-compiler-dev at openjdk.java.net
>> >     <mailto:hotspot-compiler-dev at openjdk.java.net>>
>> >     *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>> >
>> >     Hi all,
>> >
>> >     I've been working on deflaking the tests mostly and the wording in
>> >     the JVMTI spec.
>> >
>> >     Here is the two incremental webrevs:
>> >
>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/
>> >
>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/
>> >
>> >     Here is the total webrev:
>> >
>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/
>> >
>> >     Here are the notes of this change:
>> >
>> >        - Currently the tests pass 100 times in a row, I am working on
>> >     checking if they pass 1000 times in a row.
>> >
>> >        - The default sampling rate is set to 512k, this is what we use
>> >     internally and having a default means that to enable the sampling
>> >     with the default, the user only has to do a enable event/disable
>> >     event via JVMTI (instead of enable + set sample rate).
>> >
>> >        - I deprecated the code that was handling the fast path tlab
>> >     refill if it happened since this is now deprecated
>> >
>> >            - Though I saw that Graal is still using it so I have to see
>> >     what needs to be done there exactly
>> >
>> >     Finally, using the Dacapo benchmark suite, I noted a 1% overhead for
>> >     when the event system is turned on and the callback to the native
>> >     agent is just empty. I got a 3% overhead with a 512k sampling rate
>> >     with the code I put in the native side of my tests.
>> >
>> >     Thanks and comments are appreciated,
>> >
>> >     Jc
>> >
>> >     On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <jcbeyler at google.com
>> >     <mailto:jcbeyler at google.com>> wrote:
>> >
>> >         Hi all,
>> >
>> >         The incremental webrev update is here:
>> >
>> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/
>> >
>> >         The full webrev is here:
>> >
>> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/
>> >
>> >         Major change here is:
>> >
>> >            - I've removed the heapMonitoring.cpp code in favor of just
>> >         having the sampling events as per Serguei's request; I still
>> >         have to do some overhead measurements but the tests prove the
>> >         concept can work
>> >
>> >                 - Most of the tlab code is unchanged, the only major
>> >         part is that now things get sent off to event collectors when
>> >         used and enabled.
>> >
>> >            - Added the interpreter collectors to handle interpreter
>> >         execution
>> >
>> >            - Updated the name from SetTlabHeapSampling to
>> >         SetHeapSampling to be more generic
>> >
>> >            - Added a mutex for the thread sampling so that we can
>> >         initialize an internal static array safely
>> >
>> >            - Ported the tests from the old system to this new one
>> >
>> >         I've also updated the JEP and CSR to reflect these changes:
>> >
>> >         https://bugs.openjdk.java.net/browse/JDK-8194905
>> >
>> >         https://bugs.openjdk.java.net/browse/JDK-8171119
>> >
>> >         In order to make this have some forward progress, I've removed
>> >         the heap sampling code entirely and now rely entirely on the
>> >         event sampling system. The tests reflect this by using a
>> >         simplified implementation of what an agent could do:
>> >
>> >
>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
>> >
>> >         (Search for anything mentioning event_storage).
>> >
>> >         I have not taken the time to port the whole code we had
>> >         originally in heapMonitoring to this. I hesitate only because
>> >         that code was in C++, I'd have to port it to C and this is for
>> >         tests so perhaps what I have now is good enough?
>> >
>> >         As far as testing goes, I've ported all the relevant tests and
>> >         then added a few:
>> >
>> >             - Turning the system on/off
>> >
>> >             - Testing using various GCs
>> >
>> >             - Testing using the interpreter
>> >
>> >             - Testing the sampling rate
>> >
>> >             - Testing with objects and arrays
>> >
>> >             - Testing with various threads
>> >
>> >         Finally, as overhead goes, I have the numbers of the system off
>> >         vs a clean build and I have 0% overhead, which is what we'd
>> >         want. This was using the Dacapo benchmarks. I am now preparing
>> >         to run a version with the events on using dacapo and will report
>> >         back here.
>> >
>> >         Any comments are welcome :)
>> >
>> >         Jc
>> >
>> >         On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <jcbeyler at google.com
>> >         <mailto:jcbeyler at google.com>> wrote:
>> >
>> >             Hi all,
>> >
>> >             I apologize for the delay but I wanted to add an event
>> >             system and that took a bit longer than expected and I also
>> >             reworked the code to take into account the deprecation of
>> >             FastTLABRefill.
>> >
>> >             This update has four parts:
>> >
>> >             A) I moved the implementation from Thread to
>> >             ThreadHeapSampler inside of Thread. Would you prefer it as a
>> >             pointer inside of Thread or like this works for you? Second
>> >             question would be would you rather have an association
>> >             outside of Thread altogether that tries to remember when
>> >             threads are live and then we would have something like:
>> >
>> >             ThreadHeapSampler::get_sampling_size(this_thread);
>> >
>> >             I worry about the overhead of this but perhaps it is not too
>> >             too bad?
>> >
>> >             B) I also have been working on the Allocation event system
>> >             that sends out a notification at each sampled event. This
>> >             will be practical when wanting to do something at the
>> >             allocation point. I'm also looking at if the whole
>> >             heapMonitoring code could not reside in the agent code and
>> >             not in the JDK. I'm not convinced but I'm talking to Serguei
>> >             about it to see/assess :)
>> >
>> >                 - Also added two tests for the new event subsystem
>> >
>> >             C) Removed the slow_path fields inside the TLAB code since
>> >             now FastTLABRefill is deprecated
>> >
>> >             D) Updated the JVMTI documentation and specification for the
>> >             methods.
>> >
>> >             So the incremental webrev is here:
>> >
>> >             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/
>> >
>> >             and the full webrev is here:
>> >
>> >             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10
>> >
>> >             I believe I have updated the various JIRA issues that track
>> >             this :)
>> >
>> >             Thanks for your input,
>> >
>> >             Jc
>> >
>> >             On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler
>> >             <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
>> >
>> >                 Hi Erik,
>> >
>> >                 I inlined my answers, which the last one seems to answer
>> >                 Robbin's concerns about the same thing (adding things to
>> >                 Thread).
>> >
>> >                 On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund
>> >                 <erik.osterlund at oracle.com
>> >                 <mailto:erik.osterlund at oracle.com>> wrote:
>> >
>> >                     Hi JC,
>> >
>> >                     Comments are inlined below.
>> >
>> >                     On 2018-02-13 06:18, JC Beyler wrote:
>> >
>> >                         Hi Erik,
>> >
>> >                         Thanks for your answers, I've now inlined my own
>> >                         answers/comments.
>> >
>> >                         I've done a new webrev here:
>> >
>> >
>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
>> >                         <
>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>
>> >
>> >                         The incremental is here:
>> >
>> >
>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
>> >                         <
>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>
>> >
>> >                         Note to all:
>> >
>> >                            - I've been integrating changes from
>> >                         Erin/Serguei/David comments so this webrev
>> >                         incremental is a bit an answer to all comments
>> >                         in one. I apologize for that :)
>> >
>> >                         On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund
>> >                         <erik.osterlund at oracle.com
>> >                         <mailto:erik.osterlund at oracle.com>> wrote:
>> >
>> >                             Hi JC,
>> >
>> >                             Sorry for the delayed reply.
>> >
>> >                             Inlined answers:
>> >
>> >
>> >
>> >                             On 2018-02-06 00:04, JC Beyler wrote:
>> >
>> >                                 Hi Erik,
>> >
>> >                                 (Renaming this to be folded into the
>> >                                 newly renamed thread :))
>> >
>> >                                 First off, thanks a lot for reviewing
>> >                                 the webrev! I appreciate it!
>> >
>> >                                 I updated the webrev to:
>> >
>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
>> >                                 <
>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>
>> >
>> >                                 And the incremental one is here:
>> >
>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
>> >                                 <
>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>
>> >
>> >                                 It contains:
>> >                                 - The change for since from 9 to 11 for
>> >                                 the jvmti.xml
>> >                                 - The use of the OrderAccess for
>> initialized
>> >                                 - Clearing the oop
>> >
>> >                                 I also have inlined my answers to your
>> >                                 comments. The biggest question
>> >                                 will come from the multiple *_end
>> >                                 variables. A bit of the logic there
>> >                                 is due to handling the slow path refill
>> >                                 vs fast path refill and
>> >                                 checking that the rug was not pulled
>> >                                 underneath the slowpath. I
>> >                                 believe that a previous comment was that
>> >                                 TlabFastRefill was going to
>> >                                 be deprecated.
>> >
>> >                                 If this is true, we could revert this
>> >                                 code a bit and just do a : if
>> >                                 TlabFastRefill is enabled, disable this.
>> >                                 And then deprecate that when
>> >                                 TlabFastRefill is deprecated.
>> >
>> >                                 This might simplify this webrev and I
>> >                                 can work on a follow-up that
>> >                                 either: removes TlabFastRefill if Robbin
>> >                                 does not have the time to do
>> >                                 it or add the support to the assembly
>> >                                 side to handle this correctly.
>> >                                 What do you think?
>> >
>> >                             I support removing TlabFastRefill, but I
>> >                             think it is good to not depend on that
>> >                             happening first.
>> >
>> >
>> >                         I'm slowly pushing on the FastTLABRefill
>> >                         (
>> https://bugs.openjdk.java.net/browse/JDK-8194084),
>> >                         I agree on keeping both separate for now though
>> >                         so that we can think of both differently
>> >
>> >                                 Now, below, inlined are my answers:
>> >
>> >                                 On Fri, Feb 2, 2018 at 8:44 AM, Erik
>> >                                 Österlund
>> >                                 <erik.osterlund at oracle.com
>> >                                 <mailto:erik.osterlund at oracle.com>>
>> wrote:
>> >
>> >                                     Hi JC,
>> >
>> >                                     Hope I am reviewing the right
>> >                                     version of your work. Here goes...
>> >
>> >
>>  src/hotspot/share/gc/shared/collectedHeap.inline.hpp:
>> >
>> >                                        159
>> >
>>  AllocTracer::send_allocation_outside_tlab(klass, result, size *
>> >                                     HeapWordSize, THREAD);
>> >                                        160
>> >                                        161
>> >
>>  THREAD->tlab().handle_sample(THREAD, result, size);
>> >                                        162     return result;
>> >                                        163   }
>> >
>> >                                     Should not call tlab()->X without
>> >                                     checking if (UseTLAB) IMO.
>> >
>> >                                 Done!
>> >
>> >
>> >                             More about this later.
>> >
>> >
>>  src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:
>> >
>> >                                     So first of all, there seems to
>> >                                     quite a few ends. There is an "end",
>> >                                     a "hard
>> >                                     end", a "slow path end", and an
>> >                                     "actual end". Moreover, it seems
>> >                                     like the
>> >                                     "hard end" is actually further away
>> >                                     than the "actual end". So the "hard
>> end"
>> >                                     seems like more of a "really
>> >                                     definitely actual end" or something.
>> >                                     I don't
>> >                                     know about you, but I think it looks
>> >                                     kind of messy. In particular, I
>> don't
>> >                                     feel like the name "actual end"
>> >                                     reflects what it represents,
>> >                                     especially when
>> >                                     there is another end that is behind
>> >                                     the "actual end".
>> >
>> >                                        413 HeapWord*
>> >                                     ThreadLocalAllocBuffer::hard_end() {
>> >                                        414   // Did a fast TLAB refill
>> >                                     occur?
>> >                                        415   if (_slow_path_end !=
>> _end) {
>> >                                        416     // Fix up the actual end
>> >                                     to be now the end of this TLAB.
>> >                                        417     _slow_path_end = _end;
>> >                                        418     _actual_end = _end;
>> >                                        419   }
>> >                                        420
>> >                                        421   return _actual_end +
>> >                                     alignment_reserve();
>> >                                        422 }
>> >
>> >                                     I really do not like making getters
>> >                                     unexpectedly have these kind of side
>> >                                     effects. It is not expected that
>> >                                     when you ask for the "hard end", you
>> >                                     implicitly update the "slow path
>> >                                     end" and "actual end" to new values.
>> >
>> >                                 As I said, a lot of this is due to the
>> >                                 FastTlabRefill. If I make this
>> >                                 not supporting FastTlabRefill, this goes
>> >                                 away. The reason the system
>> >                                 needs to update itself at the get is
>> >                                 that you only know at that get if
>> >                                 things have shifted underneath the tlab
>> >                                 slow path. I am not sure of
>> >                                 really better names (naming is hard!),
>> >                                 perhaps we could do these
>> >                                 names:
>> >
>> >                                 - current_tlab_end       // Either the
>> >                                 allocated tlab end or a sampling point
>> >                                 - last_allocation_address  // The end of
>> >                                 the tlab allocation
>> >                                 - last_slowpath_allocated_end  // In
>> >                                 case a fast refill occurred the
>> >                                 end might have changed, this is to
>> >                                 remember slow vs fast past refills
>> >
>> >                                 the hard_end method can be renamed to
>> >                                 something like:
>> >                                 tlab_end_pointer()        // The end of
>> >                                 the lab including a bit of
>> >                                 alignment reserved bytes
>> >
>> >                             Those names sound better to me. Could you
>> >                             please provide a mapping from the old names
>> >                             to the new names so I understand which one
>> >                             is which please?
>> >
>> >                             This is my current guess of what you are
>> >                             proposing:
>> >
>> >                             end -> current_tlab_end
>> >                             actual_end -> last_allocation_address
>> >                             slow_path_end -> last_slowpath_allocated_end
>> >                             hard_end -> tlab_end_pointer
>> >
>> >                         Yes that is correct, that was what I was
>> proposing.
>> >
>> >                             I would prefer this naming:
>> >
>> >                             end -> slow_path_end // the end for taking a
>> >                             slow path; either due to sampling or
>> refilling
>> >                             actual_end -> allocation_end // the end for
>> >                             allocations
>> >                             slow_path_end -> last_slow_path_end // last
>> >                             address for slow_path_end (as opposed to
>> >                             allocation_end)
>> >                             hard_end -> reserved_end // the end of the
>> >                             reserved space of the TLAB
>> >
>> >                             About setting things in the getter... that
>> >                             still seems like a very unpleasant thing to
>> >                             me. It would be better to inspect the call
>> >                             hierarchy and explicitly update the ends
>> >                             where they need updating, and assert in the
>> >                             getter that they are in sync, rather than
>> >                             implicitly setting various ends as a
>> >                             surprising side effect in a getter. It looks
>> >                             like the call hierarchy is very small. With
>> >                             my new naming convention, reserved_end()
>> >                             would presumably return _allocation_end +
>> >                             alignment_reserve(), and have an assert
>> >                             checking that _allocation_end ==
>> >                             _last_slow_path_allocation_end, complaining
>> >                             that this invariant must hold, and that a
>> >                             caller to this function, such as
>> >                             make_parsable(), must first explicitly
>> >                             synchronize the ends as required, to honor
>> >                             that invariant.
>> >
>> >
>> >                         I've renamed the variables to how you preferred
>> >                         it except for the _end one. I did:
>> >
>> >                         current_end
>> >
>> >                         last_allocation_address
>> >
>> >                         tlab_end_ptr
>> >
>> >                         The reason is that the architecture dependent
>> >                         code use the thread.hpp API and it already has
>> >                         tlab included into the name so it becomes
>> >                         tlab_current_end (which is better that
>> >                         tlab_current_tlab_end in my opinion).
>> >
>> >                         I also moved the update into a separate method
>> >                         with a TODO that says to remove it when
>> >                         FastTLABRefill is deprecated
>> >
>> >                     This looks a lot better now. Thanks.
>> >
>> >                     Note that the following comment now needs updating
>> >                     accordingly in threadLocalAllocBuffer.hpp:
>> >
>> >                        41 //            Heap sampling is performed via
>> >                     the end/actual_end fields.
>> >
>> >                        42 //            actual_end contains the real end
>> >                     of the tlab allocation,
>> >
>> >                        43 //            whereas end can be set to an
>> >                     arbitrary spot in the tlab to
>> >
>> >                        44 //            trip the return and sample the
>> >                     allocation.
>> >
>> >                        45 //            slow_path_end is used to track
>> >                     if a fast tlab refill occured
>> >
>> >                        46 //            between slowpath calls.
>> >
>> >                     There might be other comments too, I have not looked
>> >                     in detail.
>> >
>> >                 This was the only spot that still had an actual_end, I
>> >                 fixed it now. I'll do a sweep to double check other
>> >                 comments.
>> >
>> >
>> >
>> >                                 Not sure it's better but before updating
>> >                                 the webrev, I wanted to try
>> >                                 to get input/consensus :)
>> >
>> >                                 (Note hard_end was always further off
>> >                                 than end).
>> >
>> >                                     src/hotspot/share/prims/jvmti.xml:
>> >
>> >                                     10357       <capabilityfield
>> >                                     id="can_sample_heap" since="9">
>> >                                     10358         <description>
>> >                                     10359           Can sample the heap.
>> >                                     10360           If this capability
>> >                                     is enabled then the heap sampling
>> >                                     methods
>> >                                     can be called.
>> >                                     10361         </description>
>> >                                     10362       </capabilityfield>
>> >
>> >                                     Looks like this capability should
>> >                                     not be "since 9" if it gets
>> integrated
>> >                                     now.
>> >
>> >                                 Updated now to 11, crossing my fingers
>> :)
>> >
>> >
>>  src/hotspot/share/runtime/heapMonitoring.cpp:
>> >
>> >                                        448       if
>> >                                     (is_alive->do_object_b(value)) {
>> >                                        449         // Update the oop to
>> >                                     point to the new object if it is
>> still
>> >                                     alive.
>> >                                        450
>>  f->do_oop(&(trace.obj));
>> >                                        451
>> >                                        452         // Copy the old
>> >                                     trace, if it is still live.
>> >                                        453
>> >
>>  _allocated_traces->at_put(curr_pos++, trace);
>> >                                        454
>> >                                        455         // Store the live
>> >                                     trace in a cache, to be served up on
>> >                                     /heapz.
>> >                                        456
>> >
>>  _traces_on_last_full_gc->append(trace);
>> >                                        457
>> >                                        458         count++;
>> >                                        459       } else {
>> >                                        460         // If the old trace
>> >                                     is no longer live, add it to the
>> list of
>> >                                        461         // recently collected
>> >                                     garbage.
>> >                                        462
>> >                                       store_garbage_trace(trace);
>> >                                        463       }
>> >
>> >                                     In the case where the oop was not
>> >                                     live, I would like it to be
>> explicitly
>> >                                     cleared.
>> >
>> >                                 Done I think how you wanted it. Let me
>> >                                 know because I'm not familiar
>> >                                 with the RootAccess API. I'm unclear if
>> >                                 I'm doing this right or not so
>> >                                 reviews of these parts are highly
>> >                                 appreciated. Robbin had talked of
>> >                                 perhaps later pushing this all into a
>> >                                 OopStorage, should I do this now
>> >                                 do you think? Or can that wait a second
>> >                                 webrev later down the road?
>> >
>> >                             I think using handles can and should be done
>> >                             later. You can use the Access API now.
>> >                             I noticed that you are missing an #include
>> >                             "oops/access.inline.hpp" in your
>> >                             heapMonitoring.cpp file.
>> >
>> >                         The missing header is there for me so I don't
>> >                         know, I made sure it is present in the latest
>> >                         webrev. Sorry about that.
>> >
>> >                                 + Did I clear it the way you wanted me
>> >                                 to or were you thinking of
>> >                                 something else?
>> >
>> >
>> >                             That is precisely how I wanted it to be
>> >                             cleared. Thanks.
>> >
>> >                                 + Final question here, seems like if I
>> >                                 were to want to not do the
>> >                                 f->do_oop directly on the trace.obj, I'd
>> >                                 need to do something like:
>> >
>> >                                      f->do_oop(&value);
>> >                                      ...
>> >                                      trace->store_oop(value);
>> >
>> >                                 to update the oop internally. Is that
>> >                                 right/is that one of the
>> >                                 advantages of going to the Oopstorage
>> >                                 sooner than later?
>> >
>> >
>> >                             I think you really want to do the do_oop on
>> >                             the root directly. Is there a particular
>> >                             reason why you would not want to do that?
>> >                             Otherwise, yes - the benefit with using the
>> >                             handle approach is that you do not need to
>> >                             call do_oop explicitly in your code.
>> >
>> >                         There is no reason except that now we have a
>> >                         load_oop and a get_oop_addr, I was not sure what
>> >                         you would think of that.
>> >
>> >                     That's fine.
>> >
>> >                                     Also I see a lot of
>> >                                     concurrent-looking use of the
>> >                                     following field:
>> >                                        267   volatile bool _initialized;
>> >
>> >                                     Please note that the "volatile"
>> >                                     qualifier does not help with
>> reordering
>> >                                     here. Reordering between volatile
>> >                                     and non-volatile fields is
>> >                                     completely free
>> >                                     for both compiler and hardware,
>> >                                     except for windows with MSVC, where
>> >                                     volatile
>> >                                     semantics is defined to use
>> >                                     acquire/release semantics, and the
>> >                                     hardware is
>> >                                     TSO. But for the general case, I
>> >                                     would expect this field to be stored
>> >                                     with
>> >                                     OrderAccess::release_store and
>> >                                     loaded with
>> OrderAccess::load_acquire.
>> >                                     Otherwise it is not thread safe.
>> >
>> >                                 Because everything is behind a mutex, I
>> >                                 wasn't really worried about
>> >                                 this. I have a test that has multiple
>> >                                 threads trying to hit this
>> >                                 corner case and it passes.
>> >
>> >                                 However, to be paranoid, I updated it to
>> >                                 using the OrderAccess API
>> >                                 now, thanks! Let me know what you think
>> >                                 there too!
>> >
>> >
>> >                             If it is indeed always supposed to be read
>> >                             and written under a mutex, then I would
>> >                             strongly prefer to have it accessed as a
>> >                             normal non-volatile member, and have an
>> >                             assertion that given lock is held or we are
>> >                             in a safepoint, as we do in many other
>> >                             places. Something like this:
>> >
>> >
>>  assert(HeapMonitorStorage_lock->owned_by_self()
>> >                             || (SafepointSynchronize::is_at_safepoint()
>> >                             && Thread::current()->is_VM_thread()), "this
>> >                             should not be accessed concurrently");
>> >
>> >                             It would be confusing to people reading the
>> >                             code if there are uses of OrderAccess that
>> >                             are actually always protected under a mutex.
>> >
>> >                         Thank you for the exact example to be put in the
>> >                         code! I put it around each access/assignment of
>> >                         the _initialized method and found one case where
>> >                         yes you can touch it and not have the lock. It
>> >                         actually is "ok" because you don't act on the
>> >                         storage until later and only when you really
>> >                         want to modify the storage (see the
>> >                         object_alloc_do_sample method which calls the
>> >                         add_trace method).
>> >
>> >                         But, because of this, I'm going to put the
>> >                         OrderAccess here, I'll do some performance
>> >                         numbers later and if there are issues, I might
>> >                         add a "unsafe" read and a "safe" one to make it
>> >                         explicit to the reader. But I don't think it
>> >                         will come to that.
>> >
>> >
>> >                     Okay. This double return in heapMonitoring.cpp looks
>> >                     wrong:
>> >
>> >                       283   bool initialized() {
>> >                       284     return
>> >                     OrderAccess::load_acquire(&_initialized) != 0;
>> >                       285     return _initialized;
>> >                       286   }
>> >
>> >                     Since you said object_alloc_do_sample() is the only
>> >                     place where you do not hold the mutex while reading
>> >                     initialized(), I had a closer look at that. It looks
>> >                     like in its current shape, the lack of a mutex may
>> >                     lead to a memory leak. In particular, it first
>> >                     checks if (initialized()). Let's assume this is now
>> >                     true. It then allocates a bunch of stuff, and checks
>> >                     if the number of frames were over 0. If they were,
>> >                     it calls StackTraceStorage::storage()->add_trace()
>> >                     seemingly hoping that after grabbing the lock in
>> >                     there, initialized() will still return true. But it
>> >                     could now return false and skip doing anything, in
>> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20180406/27239706/attachment-0001.html>