JDK-8171119: Low-Overhead Heap Profiling

Mon Apr 9 05:48:46 UTC 2018

Hi all,

Here is the new webrev:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.12/

with the incremental here:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11_12/

After banging my head against the interactions of the runtime and the GC, I
finally got the next features up and ready:
    - The multi-agent support (with a new test)
    - The thread support (with a new test)
    - I have an assert at the moment of allocation to check there is a
sampled collector available (though could be disabled)

    - This webrev puts the collector at the collectedHeap.inline.hpp level
(so removing all of the outer collectors)
        - Note there is one current caveat: if the agent requests the
VMObjectAlloc event, the sampler defers to that event due to a limitation
in its implementation (ie: I am not convinced I can safely send out that
event with the VM collector enabled, I'll happily white board that).
           - I updated the jvmti.xml accordingly btw.

Let me know what you think!
Jc

On Fri, Apr 6, 2018 at 4:12 PM JC Beyler <jcbeyler at google.com> wrote:

> Hi Karen,
>
> Let me inline my answers, it will probably be easier :)
>
> On Fri, Apr 6, 2018 at 2:40 PM Karen Kinnear <karen.kinnear at oracle.com>
> wrote:
>
>> JC,
>>
>> Thank you for the updates - really glad you are including the compiler
>> folks. I reviewed the version before this one, so ignore
>> any comments you’ve already covered (although I did peek at the latest)
>>
>> 1. JDK-8194905 CSR - could you please delete attachments that are not
>> current. It is a bit confusing right now.
>> I have been looking at jvmti_event6.html. I am assuming the rest are
>> obsolete and could be removed please.
>>
>
> I tried but it does not allow me to do. It seems that someone with
> admistrative rights has to do it :-(. That is why I had to resort to this...
>
>
>>
>> 2. In jvmti_event6.html under Sampled Object Allocation, there is a link
>> to “Heap Sampling Monitoring System”.
>> It takes me to the top of the page - seems like something is missing in
>> defining it?
>>
>
> So the Heap Sampling Monitoring System used to have more methods. It made
> sense to have them in a separate category. I now have moved it to the
> memory category to be consistent and grouped there. I also removed that
> link btw.
>
>
>
>>
>> 3. Scope of memory allocation tracking
>>
>> I am struggling to understand the extent of memory allocation tracking
>> that you are looking for (probably want to
>> clarify in the JEP and CSR once we work this through).
>>
>> e.g. Heap Sampler vs. JVMTI VMObjectAllocEvent
>> So the current jvmtiVMObjectAllocEvent says:
>>
>> Sent when a method causes the VM to allocate an Object visible to Java
>> and allocation not detectable by other instrumentation mechanisms.
>> Generally detect by instrumenting bytecodes of allocating methods
>> JNI - use JNI function interception
>> e.g. Reflection: java.lang.Class.newInstance()
>> e.g. VM intrinsics
>>
>> comment:
>> Not generated due to bytecodes - e.g. new and newarray VM instructions
>> Not allocation due to JNI: e.g. AllocObject
>> NOT VM internal objects
>> NOT allocations during VM init
>>
>> So from the JEP I can’t tell the intended scope of the new event - is
>> this intended to cover all heap allocation?
>>     bytecodes
>>     JVM_*
>>    JNI_*
>>    internal VM objects
>>    other? (I’m not sure what other there are)
>>    - I presume not allocations during VM init - since sent only during
>> live phase
>>
>
> Yes exactly, as much as possible, I am aiming to cover all heap
> allocations. Mostly though and in practice, I think we care about bytecodes
> and to a lesser extend JNI. In being independent of why the memory is being
> allocated is probably even better: this thread allocated Y, no matter
> where/why that ones.
>
>
>>
>> OR - is the primary goal to cover allocation for bytecodes so folks can
>> skip instrumentation?
>>
>
> Yes that is the primary goal.
>
>
>> OR - do you want to get performance numbers and see what is low enough
>> overhead before deciding?
>>
>
> I think it is the same, the system is relatively in place and my overhead
> seems to indicate that there is a 0% off, 1% on but the callback to the
> user is empty, 3% for a naive implementation tracking live/GC'd objects.
>
>
>>
>> 4. The design question is where to put the collectors in the source base
>> - and that of course strongly depends on
>> the scope of the information you want to collect, and on the performance
>> overhead we are willing to incur.
>>
>
> Very true.
>
>
>>
>> I was trying to figure out a way to put the collectors farther down the
>> call stack so as to both catch more
>> cases and to reduce the maintenance burden - i.e. if you were to add a
>> new code generator, e.g. Graal -
>> if it were to go through an existing interface, that might be a place to
>> already have a collector.
>>
>> I do not know the Graal sources - I did look at jvmci/jvmciRuntime.cpp -
>> and it appears that there
>> are calls to instanceKlass::new_instance,
>> oopFactory::new_typeArray/new_ObjArray and ArrayKlass::multi-allocate,
>> so one possibility would be to put hooks in those calls which would catch
>> many? (I did not do a thorough search)
>> of the slowpath calls for the bytecodes, and then check the fast paths in
>> detail.
>>
>
> I'll come to a major issue with the collector and its placement in the
> next paragraph.
>
>
>>
>> I had wondered if it made sense to move the hooks even farther down, into
>> CollectedHeap:obj_allocate and array_allocate.
>> I do not think so. First reason is that for multidimensional arrays,
>> ArrayKlass::multi_allocate the outer dimension array would
>> have an event before storing the inner sub-arrays and I don’t think we
>> want that exposed, so that won’t work for arrays.
>>
>
> So the major difficulty is that the steps of collection do this:
>
> - An object gets allocated and is decided to be sampled
> - The original pointer placement (where it resides originally in memory)
> is passed to the collector
> - Now one important thing of note:
>     (a) In the VM code, until the point where the oop is going to be
> returned, GC is not yet aware of it
>     (b) so the collector can't yet send it out to the user via JVMTI
> otherwise, the agent could put a weak reference for example
>
> I'm a bit fuzzy on this and maybe it's just that there would be more heavy
> lifting to make this possible but my initial tests seem to show problems
> when attempting this in the obj_allocate area.
>
>
>>
>> The second reason is that I strongly suspect the scope you want is
>> bytecodes only. I think once you have added hooks
>> to all the fast paths and slow paths that this will be pushing the
>> performance overhead constraints you proposed and
>> you won’t want to see e.g. internal allocations.
>>
>
> Yes agreed, allocations from bytecodes are mostly our concern generally :)
>
>
>>
>>
> But I think you need to experiment with the set of allocations (or
>> possible alternative sets of allocations) you want recorded.
>>
>> The hooks I see today include:
>> Interpreter: (looking at x86 as a sample)
>>   - slowpath in InterpreterRuntime
>>   - fastpath tlab allocation - your new threshold check handles that
>>
>
> Agreed
>
>
>>   - allow_shared_alloc (GC specific): for _new isn’t handled
>>
>
> Where is that exactly? I can check why we are not catching it?
>
>
>>
>> C1
>>   I don’t see changes in c1_Runtime.cpp
>>   note: you also want to look for the fast path
>>
>
> I added the calls to c1_Runtime in the latest webrev, but was still going
> through testing before pushing it out. I had waited on this one a bit. Fast
> path would be handled by the threshold check no?
>
>
>> C2: changes in opto/runtime.cpp for slow path
>>    did you also catch the fast path?
>>
>
> Fast path gets handled by the same threshold check, no? Perhaps I've
> missed something (very likely)?
>
>
>>
>> 3. Performance -
>> After you get all the collectors added - you need to rerun the
>> performance numbers.
>>
>
> Agreed :)
>
>
>>
>> thanks,
>> Karen
>>
>> On Apr 5, 2018, at 2:15 PM, JC Beyler <jcbeyler at google.com> wrote:
>>
>> Thanks Boris and Derek for testing it.
>>
>> Yes I was trying to get a new version out that had the tests ported as
>> well but got sidetracked while trying to add tests and two new features.
>>
>> Here is the incremental webrev:
>>
>> Here is the full webrev:
>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11/
>>
>> Basically, the new tests assert this:
>>   - Only one agent can currently ask for the sampling, I'm currently
>> seeing if I can push to a next webrev the multi-agent support to start
>> doing a code freeze on this one
>>   - The event is not thread-enabled, meaning like the
>> VMObjectAllocationEvent, it's an all or nothing event; same as the
>> multi-agent, I'm going to see if a future webrev to add the support is a
>> better idea to freeze this webrev a bit
>>
>> There was another item that I added here and I'm unsure this webrev is
>> stable in debug mode: I added an assertion system to ascertain that all
>> paths leading to a TLAB slow path (and hence a sampling point) have a
>> sampling collector ready to post the event if a user wants it. This might
>> break a few thing in debug mode as I'm working through the kinks of that as
>> well. However, in release mode, this new webrev passes all the tests in
>> hotspot/jtreg/serviceability/jvmti/HeapMonitor.
>>
>> Let me know what you think,
>> Jc
>>
>> On Thu, Apr 5, 2018 at 4:56 AM Boris Ulasevich <
>> boris.ulasevich at bell-sw.com> wrote:
>>
>>> Hi JC,
>>>
>>>    I have just checked on arm32: your patch compiles and runs ok.
>>>
>>>    As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does not
>>> correspond to actual library name: libHeapMonitorTest.c ->
>>> libHeapMonitorTest.so
>>>
>>> Boris
>>>
>>> On 04.04.2018 01:54, White, Derek wrote:
>>> > Thanks JC,
>>> >
>>> > New patch applies cleanly. Compiles and runs (simple test programs) on
>>> > aarch64.
>>> >
>>> >   * Derek
>>> >
>>> > *From:* JC Beyler [mailto:jcbeyler at google.com]
>>> > *Sent:* Monday, April 02, 2018 1:17 PM
>>> > *To:* White, Derek <Derek.White at cavium.com>
>>> > *Cc:* Erik Österlund <erik.osterlund at oracle.com>;
>>> > serviceability-dev at openjdk.java.net; hotspot-compiler-dev
>>> > <hotspot-compiler-dev at openjdk.java.net>
>>> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>>> >
>>> > Hi Derek,
>>> >
>>> > I know there were a few things that went in that provoked a merge
>>> > conflict. I worked on it and got it up to date. Sadly my lack of
>>> > knowledge makes it a full rebase instead of keeping all the history.
>>> > However, with a newly cloned jdk/hs you should now be able to use:
>>> >
>>> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
>>> >
>>> > The change you are referring to was done with the others so perhaps you
>>> > were unlucky and I forgot it in a webrev and fixed it in another? I
>>> > don't know but it's been there and I checked, it is here:
>>> >
>>> >
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
>>> >
>>> > I double checked that tlab_end_offset no longer appears in any
>>> > architecture (as far as I can tell :)).
>>> >
>>> > Thanks for testing and let me know if you run into any other issues!
>>> >
>>> > Jc
>>> >
>>> > On Fri, Mar 30, 2018 at 4:24 PM White, Derek <Derek.White at cavium.com
>>> > <mailto:Derek.White at cavium.com>> wrote:
>>> >
>>> >     Hi Jc,
>>> >
>>> >     I’ve been having trouble getting your patch to apply correctly. I
>>> >     may have based it on the wrong version.
>>> >
>>> >     In any case, I think there’s a missing update to
>>> >     macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
>>> >     where “JavaThread::tlab_end_offset()” should become
>>> >     “JavaThread::tlab_current_end_offset()”.
>>> >
>>> >     This should correspond to the other port’s changes in
>>> >     templateTable_<cpu>.cpp files.
>>> >
>>> >     Thanks!
>>> >     - Derek
>>> >
>>> >     *From:* hotspot-compiler-dev
>>> >     [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
>>> >     <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On Behalf
>>> >     Of *JC Beyler
>>> >     *Sent:* Wednesday, March 28, 2018 11:43 AM
>>> >     *To:* Erik Österlund <erik.osterlund at oracle.com
>>> >     <mailto:erik.osterlund at oracle.com>>
>>> >     *Cc:* serviceability-dev at openjdk.java.net
>>> >     <mailto:serviceability-dev at openjdk.java.net>; hotspot-compiler-dev
>>> >     <hotspot-compiler-dev at openjdk.java.net
>>> >     <mailto:hotspot-compiler-dev at openjdk.java.net>>
>>> >     *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>>> >
>>> >     Hi all,
>>> >
>>> >     I've been working on deflaking the tests mostly and the wording in
>>> >     the JVMTI spec.
>>> >
>>> >     Here is the two incremental webrevs:
>>> >
>>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/
>>> >
>>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/
>>> >
>>> >     Here is the total webrev:
>>> >
>>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/
>>> >
>>> >     Here are the notes of this change:
>>> >
>>> >        - Currently the tests pass 100 times in a row, I am working on
>>> >     checking if they pass 1000 times in a row.
>>> >
>>> >        - The default sampling rate is set to 512k, this is what we use
>>> >     internally and having a default means that to enable the sampling
>>> >     with the default, the user only has to do a enable event/disable
>>> >     event via JVMTI (instead of enable + set sample rate).
>>> >
>>> >        - I deprecated the code that was handling the fast path tlab
>>> >     refill if it happened since this is now deprecated
>>> >
>>> >            - Though I saw that Graal is still using it so I have to see
>>> >     what needs to be done there exactly
>>> >
>>> >     Finally, using the Dacapo benchmark suite, I noted a 1% overhead
>>> for
>>> >     when the event system is turned on and the callback to the native
>>> >     agent is just empty. I got a 3% overhead with a 512k sampling rate
>>> >     with the code I put in the native side of my tests.
>>> >
>>> >     Thanks and comments are appreciated,
>>> >
>>> >     Jc
>>> >
>>> >     On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <jcbeyler at google.com
>>> >     <mailto:jcbeyler at google.com>> wrote:
>>> >
>>> >         Hi all,
>>> >
>>> >         The incremental webrev update is here:
>>> >
>>> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/
>>> >
>>> >         The full webrev is here:
>>> >
>>> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/
>>> >
>>> >         Major change here is:
>>> >
>>> >            - I've removed the heapMonitoring.cpp code in favor of just
>>> >         having the sampling events as per Serguei's request; I still
>>> >         have to do some overhead measurements but the tests prove the
>>> >         concept can work
>>> >
>>> >                 - Most of the tlab code is unchanged, the only major
>>> >         part is that now things get sent off to event collectors when
>>> >         used and enabled.
>>> >
>>> >            - Added the interpreter collectors to handle interpreter
>>> >         execution
>>> >
>>> >            - Updated the name from SetTlabHeapSampling to
>>> >         SetHeapSampling to be more generic
>>> >
>>> >            - Added a mutex for the thread sampling so that we can
>>> >         initialize an internal static array safely
>>> >
>>> >            - Ported the tests from the old system to this new one
>>> >
>>> >         I've also updated the JEP and CSR to reflect these changes:
>>> >
>>> >         https://bugs.openjdk.java.net/browse/JDK-8194905
>>> >
>>> >         https://bugs.openjdk.java.net/browse/JDK-8171119
>>> >
>>> >         In order to make this have some forward progress, I've removed
>>> >         the heap sampling code entirely and now rely entirely on the
>>> >         event sampling system. The tests reflect this by using a
>>> >         simplified implementation of what an agent could do:
>>> >
>>> >
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
>>> >
>>> >         (Search for anything mentioning event_storage).
>>> >
>>> >         I have not taken the time to port the whole code we had
>>> >         originally in heapMonitoring to this. I hesitate only because
>>> >         that code was in C++, I'd have to port it to C and this is for
>>> >         tests so perhaps what I have now is good enough?
>>> >
>>> >         As far as testing goes, I've ported all the relevant tests and
>>> >         then added a few:
>>> >
>>> >             - Turning the system on/off
>>> >
>>> >             - Testing using various GCs
>>> >
>>> >             - Testing using the interpreter
>>> >
>>> >             - Testing the sampling rate
>>> >
>>> >             - Testing with objects and arrays
>>> >
>>> >             - Testing with various threads
>>> >
>>> >         Finally, as overhead goes, I have the numbers of the system off
>>> >         vs a clean build and I have 0% overhead, which is what we'd
>>> >         want. This was using the Dacapo benchmarks. I am now preparing
>>> >         to run a version with the events on using dacapo and will
>>> report
>>> >         back here.
>>> >
>>> >         Any comments are welcome :)
>>> >
>>> >         Jc
>>> >
>>> >         On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <jcbeyler at google.com
>>> >         <mailto:jcbeyler at google.com>> wrote:
>>> >
>>> >             Hi all,
>>> >
>>> >             I apologize for the delay but I wanted to add an event
>>> >             system and that took a bit longer than expected and I also
>>> >             reworked the code to take into account the deprecation of
>>> >             FastTLABRefill.
>>> >
>>> >             This update has four parts:
>>> >
>>> >             A) I moved the implementation from Thread to
>>> >             ThreadHeapSampler inside of Thread. Would you prefer it as
>>> a
>>> >             pointer inside of Thread or like this works for you? Second
>>> >             question would be would you rather have an association
>>> >             outside of Thread altogether that tries to remember when
>>> >             threads are live and then we would have something like:
>>> >
>>> >             ThreadHeapSampler::get_sampling_size(this_thread);
>>> >
>>> >             I worry about the overhead of this but perhaps it is not
>>> too
>>> >             too bad?
>>> >
>>> >             B) I also have been working on the Allocation event system
>>> >             that sends out a notification at each sampled event. This
>>> >             will be practical when wanting to do something at the
>>> >             allocation point. I'm also looking at if the whole
>>> >             heapMonitoring code could not reside in the agent code and
>>> >             not in the JDK. I'm not convinced but I'm talking to
>>> Serguei
>>> >             about it to see/assess :)
>>> >
>>> >                 - Also added two tests for the new event subsystem
>>> >
>>> >             C) Removed the slow_path fields inside the TLAB code since
>>> >             now FastTLABRefill is deprecated
>>> >
>>> >             D) Updated the JVMTI documentation and specification for
>>> the
>>> >             methods.
>>> >
>>> >             So the incremental webrev is here:
>>> >
>>> >             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/
>>> >
>>> >             and the full webrev is here:
>>> >
>>> >             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10
>>> >
>>> >             I believe I have updated the various JIRA issues that track
>>> >             this :)
>>> >
>>> >             Thanks for your input,
>>> >
>>> >             Jc
>>> >
>>> >             On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler
>>> >             <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
>>> >
>>> >                 Hi Erik,
>>> >
>>> >                 I inlined my answers, which the last one seems to
>>> answer
>>> >                 Robbin's concerns about the same thing (adding things
>>> to
>>> >                 Thread).
>>> >
>>> >                 On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund
>>> >                 <erik.osterlund at oracle.com
>>> >                 <mailto:erik.osterlund at oracle.com>> wrote:
>>> >
>>> >                     Hi JC,
>>> >
>>> >                     Comments are inlined below.
>>> >
>>> >                     On 2018-02-13 06:18, JC Beyler wrote:
>>> >
>>> >                         Hi Erik,
>>> >
>>> >                         Thanks for your answers, I've now inlined my
>>> own
>>> >                         answers/comments.
>>> >
>>> >                         I've done a new webrev here:
>>> >
>>> >
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
>>> >                         <
>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>
>>> >
>>> >                         The incremental is here:
>>> >
>>> >
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
>>> >                         <
>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>
>>> >
>>> >                         Note to all:
>>> >
>>> >                            - I've been integrating changes from
>>> >                         Erin/Serguei/David comments so this webrev
>>> >                         incremental is a bit an answer to all comments
>>> >                         in one. I apologize for that :)
>>> >
>>> >                         On Mon, Feb 12, 2018 at 6:05 AM, Erik Österlund
>>> >                         <erik.osterlund at oracle.com
>>> >                         <mailto:erik.osterlund at oracle.com>> wrote:
>>> >
>>> >                             Hi JC,
>>> >
>>> >                             Sorry for the delayed reply.
>>> >
>>> >                             Inlined answers:
>>> >
>>> >
>>> >
>>> >                             On 2018-02-06 00:04, JC Beyler wrote:
>>> >
>>> >                                 Hi Erik,
>>> >
>>> >                                 (Renaming this to be folded into the
>>> >                                 newly renamed thread :))
>>> >
>>> >                                 First off, thanks a lot for reviewing
>>> >                                 the webrev! I appreciate it!
>>> >
>>> >                                 I updated the webrev to:
>>> >
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
>>> >                                 <
>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>
>>> >
>>> >                                 And the incremental one is here:
>>> >
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
>>> >                                 <
>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>
>>> >
>>> >                                 It contains:
>>> >                                 - The change for since from 9 to 11 for
>>> >                                 the jvmti.xml
>>> >                                 - The use of the OrderAccess for
>>> initialized
>>> >                                 - Clearing the oop
>>> >
>>> >                                 I also have inlined my answers to your
>>> >                                 comments. The biggest question
>>> >                                 will come from the multiple *_end
>>> >                                 variables. A bit of the logic there
>>> >                                 is due to handling the slow path refill
>>> >                                 vs fast path refill and
>>> >                                 checking that the rug was not pulled
>>> >                                 underneath the slowpath. I
>>> >                                 believe that a previous comment was
>>> that
>>> >                                 TlabFastRefill was going to
>>> >                                 be deprecated.
>>> >
>>> >                                 If this is true, we could revert this
>>> >                                 code a bit and just do a : if
>>> >                                 TlabFastRefill is enabled, disable
>>> this.
>>> >                                 And then deprecate that when
>>> >                                 TlabFastRefill is deprecated.
>>> >
>>> >                                 This might simplify this webrev and I
>>> >                                 can work on a follow-up that
>>> >                                 either: removes TlabFastRefill if
>>> Robbin
>>> >                                 does not have the time to do
>>> >                                 it or add the support to the assembly
>>> >                                 side to handle this correctly.
>>> >                                 What do you think?
>>> >
>>> >                             I support removing TlabFastRefill, but I
>>> >                             think it is good to not depend on that
>>> >                             happening first.
>>> >
>>> >
>>> >                         I'm slowly pushing on the FastTLABRefill
>>> >                         (
>>> https://bugs.openjdk.java.net/browse/JDK-8194084),
>>> >                         I agree on keeping both separate for now though
>>> >                         so that we can think of both differently
>>> >
>>> >                                 Now, below, inlined are my answers:
>>> >
>>> >                                 On Fri, Feb 2, 2018 at 8:44 AM, Erik
>>> >                                 Österlund
>>> >                                 <erik.osterlund at oracle.com
>>> >                                 <mailto:erik.osterlund at oracle.com>>
>>> wrote:
>>> >
>>> >                                     Hi JC,
>>> >
>>> >                                     Hope I am reviewing the right
>>> >                                     version of your work. Here goes...
>>> >
>>> >
>>>  src/hotspot/share/gc/shared/collectedHeap.inline.hpp:
>>> >
>>> >                                        159
>>> >
>>>  AllocTracer::send_allocation_outside_tlab(klass, result, size *
>>> >                                     HeapWordSize, THREAD);
>>> >                                        160
>>> >                                        161
>>> >
>>>  THREAD->tlab().handle_sample(THREAD, result, size);
>>> >                                        162     return result;
>>> >                                        163   }
>>> >
>>> >                                     Should not call tlab()->X without
>>> >                                     checking if (UseTLAB) IMO.
>>> >
>>> >                                 Done!
>>> >
>>> >
>>> >                             More about this later.
>>> >
>>> >
>>>  src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:
>>> >
>>> >                                     So first of all, there seems to
>>> >                                     quite a few ends. There is an
>>> "end",
>>> >                                     a "hard
>>> >                                     end", a "slow path end", and an
>>> >                                     "actual end". Moreover, it seems
>>> >                                     like the
>>> >                                     "hard end" is actually further away
>>> >                                     than the "actual end". So the
>>> "hard end"
>>> >                                     seems like more of a "really
>>> >                                     definitely actual end" or
>>> something.
>>> >                                     I don't
>>> >                                     know about you, but I think it
>>> looks
>>> >                                     kind of messy. In particular, I
>>> don't
>>> >                                     feel like the name "actual end"
>>> >                                     reflects what it represents,
>>> >                                     especially when
>>> >                                     there is another end that is behind
>>> >                                     the "actual end".
>>> >
>>> >                                        413 HeapWord*
>>> >                                     ThreadLocalAllocBuffer::hard_end()
>>> {
>>> >                                        414   // Did a fast TLAB refill
>>> >                                     occur?
>>> >                                        415   if (_slow_path_end !=
>>> _end) {
>>> >                                        416     // Fix up the actual end
>>> >                                     to be now the end of this TLAB.
>>> >                                        417     _slow_path_end = _end;
>>> >                                        418     _actual_end = _end;
>>> >                                        419   }
>>> >                                        420
>>> >                                        421   return _actual_end +
>>> >                                     alignment_reserve();
>>> >                                        422 }
>>> >
>>> >                                     I really do not like making getters
>>> >                                     unexpectedly have these kind of
>>> side
>>> >                                     effects. It is not expected that
>>> >                                     when you ask for the "hard end",
>>> you
>>> >                                     implicitly update the "slow path
>>> >                                     end" and "actual end" to new
>>> values.
>>> >
>>> >                                 As I said, a lot of this is due to the
>>> >                                 FastTlabRefill. If I make this
>>> >                                 not supporting FastTlabRefill, this
>>> goes
>>> >                                 away. The reason the system
>>> >                                 needs to update itself at the get is
>>> >                                 that you only know at that get if
>>> >                                 things have shifted underneath the tlab
>>> >                                 slow path. I am not sure of
>>> >                                 really better names (naming is hard!),
>>> >                                 perhaps we could do these
>>> >                                 names:
>>> >
>>> >                                 - current_tlab_end       // Either the
>>> >                                 allocated tlab end or a sampling point
>>> >                                 - last_allocation_address  // The end
>>> of
>>> >                                 the tlab allocation
>>> >                                 - last_slowpath_allocated_end  // In
>>> >                                 case a fast refill occurred the
>>> >                                 end might have changed, this is to
>>> >                                 remember slow vs fast past refills
>>> >
>>> >                                 the hard_end method can be renamed to
>>> >                                 something like:
>>> >                                 tlab_end_pointer()        // The end of
>>> >                                 the lab including a bit of
>>> >                                 alignment reserved bytes
>>> >
>>> >                             Those names sound better to me. Could you
>>> >                             please provide a mapping from the old names
>>> >                             to the new names so I understand which one
>>> >                             is which please?
>>> >
>>> >                             This is my current guess of what you are
>>> >                             proposing:
>>> >
>>> >                             end -> current_tlab_end
>>> >                             actual_end -> last_allocation_address
>>> >                             slow_path_end ->
>>> last_slowpath_allocated_end
>>> >                             hard_end -> tlab_end_pointer
>>> >
>>> >                         Yes that is correct, that was what I was
>>> proposing.
>>> >
>>> >                             I would prefer this naming:
>>> >
>>> >                             end -> slow_path_end // the end for taking
>>> a
>>> >                             slow path; either due to sampling or
>>> refilling
>>> >                             actual_end -> allocation_end // the end for
>>> >                             allocations
>>> >                             slow_path_end -> last_slow_path_end // last
>>> >                             address for slow_path_end (as opposed to
>>> >                             allocation_end)
>>> >                             hard_end -> reserved_end // the end of the
>>> >                             reserved space of the TLAB
>>> >
>>> >                             About setting things in the getter... that
>>> >                             still seems like a very unpleasant thing to
>>> >                             me. It would be better to inspect the call
>>> >                             hierarchy and explicitly update the ends
>>> >                             where they need updating, and assert in the
>>> >                             getter that they are in sync, rather than
>>> >                             implicitly setting various ends as a
>>> >                             surprising side effect in a getter. It
>>> looks
>>> >                             like the call hierarchy is very small. With
>>> >                             my new naming convention, reserved_end()
>>> >                             would presumably return _allocation_end +
>>> >                             alignment_reserve(), and have an assert
>>> >                             checking that _allocation_end ==
>>> >                             _last_slow_path_allocation_end, complaining
>>> >                             that this invariant must hold, and that a
>>> >                             caller to this function, such as
>>> >                             make_parsable(), must first explicitly
>>> >                             synchronize the ends as required, to honor
>>> >                             that invariant.
>>> >
>>> >
>>> >                         I've renamed the variables to how you preferred
>>> >                         it except for the _end one. I did:
>>> >
>>> >                         current_end
>>> >
>>> >                         last_allocation_address
>>> >
>>> >                         tlab_end_ptr
>>> >
>>> >                         The reason is that the architecture dependent
>>> >                         code use the thread.hpp API and it already has
>>> >                         tlab included into the name so it becomes
>>> >                         tlab_current_end (which is better that
>>> >                         tlab_current_tlab_end in my opinion).
>>> >
>>> >                         I also moved the update into a separate method
>>> >                         with a TODO that says to remove it when
>>> >                         FastTLABRefill is deprecated
>>> >
>>> >                     This looks a lot better now. Thanks.
>>> >
>>> >                     Note that the following comment now needs updating
>>> >                     accordingly in threadLocalAllocBuffer.hpp:
>>> >
>>> >                        41 //            Heap sampling is performed via
>>> >                     the end/actual_end fields.
>>> >
>>> >                        42 //            actual_end contains the real
>>> end
>>> >                     of the tlab allocation,
>>> >
>>> >                        43 //            whereas end can be set to an
>>> >                     arbitrary spot in the tlab to
>>> >
>>> >                        44 //            trip the return and sample the
>>> >                     allocation.
>>> >
>>> >                        45 //            slow_path_end is used to track
>>> >                     if a fast tlab refill occured
>>> >
>>> >                        46 //            between slowpath calls.
>>> >
>>> >                     There might be other comments too, I have not
>>> looked
>>> >                     in detail.
>>> >
>>> >                 This was the only spot that still had an actual_end, I
>>> >                 fixed it now. I'll do a sweep to double check other
>>> >                 comments.
>>> >
>>> >
>>> >
>>> >                                 Not sure it's better but before
>>> updating
>>> >                                 the webrev, I wanted to try
>>> >                                 to get input/consensus :)
>>> >
>>> >                                 (Note hard_end was always further off
>>> >                                 than end).
>>> >
>>> >                                     src/hotspot/share/prims/jvmti.xml:
>>> >
>>> >                                     10357       <capabilityfield
>>> >                                     id="can_sample_heap" since="9">
>>> >                                     10358         <description>
>>> >                                     10359           Can sample the
>>> heap.
>>> >                                     10360           If this capability
>>> >                                     is enabled then the heap sampling
>>> >                                     methods
>>> >                                     can be called.
>>> >                                     10361         </description>
>>> >                                     10362       </capabilityfield>
>>> >
>>> >                                     Looks like this capability should
>>> >                                     not be "since 9" if it gets
>>> integrated
>>> >                                     now.
>>> >
>>> >                                 Updated now to 11, crossing my fingers
>>> :)
>>> >
>>> >
>>>  src/hotspot/share/runtime/heapMonitoring.cpp:
>>> >
>>> >                                        448       if
>>> >                                     (is_alive->do_object_b(value)) {
>>> >                                        449         // Update the oop to
>>> >                                     point to the new object if it is
>>> still
>>> >                                     alive.
>>> >                                        450
>>>  f->do_oop(&(trace.obj));
>>> >                                        451
>>> >                                        452         // Copy the old
>>> >                                     trace, if it is still live.
>>> >                                        453
>>> >
>>>  _allocated_traces->at_put(curr_pos++, trace);
>>> >                                        454
>>> >                                        455         // Store the live
>>> >                                     trace in a cache, to be served up
>>> on
>>> >                                     /heapz.
>>> >                                        456
>>> >
>>>  _traces_on_last_full_gc->append(trace);
>>> >                                        457
>>> >                                        458         count++;
>>> >                                        459       } else {
>>> >                                        460         // If the old trace
>>> >                                     is no longer live, add it to the
>>> list of
>>> >                                        461         // recently
>>> collected
>>> >                                     garbage.
>>> >                                        462
>>> >                                       store_garbage_trace(trace);
>>> >                                        463       }
>>> >
>>> >                                     In the case where the oop was not
>>> >                                     live, I would like it to be
>>> explicitly
>>> >                                     cleared.
>>> >
>>> >                                 Done I think how you wanted it. Let me
>>> >                                 know because I'm not familiar
>>> >                                 with the RootAccess API. I'm unclear if
>>> >                                 I'm doing this right or not so
>>> >                                 reviews of these parts are highly
>>> >                                 appreciated. Robbin had talked of
>>> >                                 perhaps later pushing this all into a
>>> >                                 OopStorage, should I do this now
>>> >                                 do you think? Or can that wait a second
>>> >                                 webrev later down the road?
>>> >
>>> >                             I think using handles can and should be
>>> done
>>> >                             later. You can use the Access API now.
>>> >                             I noticed that you are missing an #include
>>> >                             "oops/access.inline.hpp" in your
>>> >                             heapMonitoring.cpp file.
>>> >
>>> >                         The missing header is there for me so I don't
>>> >                         know, I made sure it is present in the latest
>>> >                         webrev. Sorry about that.
>>> >
>>> >                                 + Did I clear it the way you wanted me
>>> >                                 to or were you thinking of
>>> >                                 something else?
>>> >
>>> >
>>> >                             That is precisely how I wanted it to be
>>> >                             cleared. Thanks.
>>> >
>>> >                                 + Final question here, seems like if I
>>> >                                 were to want to not do the
>>> >                                 f->do_oop directly on the trace.obj,
>>> I'd
>>> >                                 need to do something like:
>>> >
>>> >                                      f->do_oop(&value);
>>> >                                      ...
>>> >                                      trace->store_oop(value);
>>> >
>>> >                                 to update the oop internally. Is that
>>> >                                 right/is that one of the
>>> >                                 advantages of going to the Oopstorage
>>> >                                 sooner than later?
>>> >
>>> >
>>> >                             I think you really want to do the do_oop on
>>> >                             the root directly. Is there a particular
>>> >                             reason why you would not want to do that?
>>> >                             Otherwise, yes - the benefit with using the
>>> >                             handle approach is that you do not need to
>>> >                             call do_oop explicitly in your code.
>>> >
>>> >                         There is no reason except that now we have a
>>> >                         load_oop and a get_oop_addr, I was not sure
>>> what
>>> >                         you would think of that.
>>> >
>>> >                     That's fine.
>>> >
>>> >                                     Also I see a lot of
>>> >                                     concurrent-looking use of the
>>> >                                     following field:
>>> >                                        267   volatile bool
>>> _initialized;
>>> >
>>> >                                     Please note that the "volatile"
>>> >                                     qualifier does not help with
>>> reordering
>>> >                                     here. Reordering between volatile
>>> >                                     and non-volatile fields is
>>> >                                     completely free
>>> >                                     for both compiler and hardware,
>>> >                                     except for windows with MSVC, where
>>> >                                     volatile
>>> >                                     semantics is defined to use
>>> >                                     acquire/release semantics, and the
>>> >                                     hardware is
>>> >                                     TSO. But for the general case, I
>>> >                                     would expect this field to be
>>> stored
>>> >                                     with
>>> >                                     OrderAccess::release_store and
>>> >                                     loaded with
>>> OrderAccess::load_acquire.
>>> >                                     Otherwise it is not thread safe.
>>> >
>>> >                                 Because everything is behind a mutex, I
>>> >                                 wasn't really worried about
>>> >                                 this. I have a test that has multiple
>>> >                                 threads trying to hit this
>>> >                                 corner case and it passes.
>>> >
>>> >                                 However, to be paranoid, I updated it
>>> to
>>> >                                 using the OrderAccess API
>>> >                                 now, thanks! Let me know what you think
>>> >                                 there too!
>>> >
>>> >
>>> >                             If it is indeed always supposed to be read
>>> >                             and written under a mutex, then I would
>>> >                             strongly prefer to have it accessed as a
>>> >                             normal non-volatile member, and have an
>>> >                             assertion that given lock is held or we are
>>> >                             in a safepoint, as we do in many other
>>> >                             places. Something like this:
>>> >
>>> >
>>>  assert(HeapMonitorStorage_lock->owned_by_self()
>>> >                             || (SafepointSynchronize::is_at_safepoint()
>>> >                             && Thread::current()->is_VM_thread()),
>>> "this
>>> >                             should not be accessed concurrently");
>>> >
>>> >                             It would be confusing to people reading the
>>> >                             code if there are uses of OrderAccess that
>>> >                             are actually always protected under a
>>> mutex.
>>> >
>>> >                         Thank you for the exact example to be put in
>>> the
>>> >                         code! I put it around each access/assignment of
>>> >                         the _initialized method and found one case
>>> where
>>> >                         yes you can touch it and not have the lock. It
>>> >                         actually is "ok" because you don't act on the
>>> >                         storage until later and only when you really
>>> >                         want to modify the storage (see the
>>> >                         object_alloc_do_sample method which calls the
>>> >                         add_trace method).
>>> >
>>> >                         But, because of this, I'm going to put the
>>> >                         OrderAccess here, I'll do some performance
>>> >                         numbers later and if there are issues, I might
>>> >                         add a "unsafe" read and a "safe" one to make it
>>> >                         explicit to the reader. But I don't think it
>>> >                         will come to that.
>>> >
>>> >
>>> >                     Okay. This double return in heapMonitoring.cpp
>>> looks
>>> >                     wrong:
>>> >
>>> >                       283   bool initialized() {
>>> >                       284     return
>>> >                     OrderAccess::load_acquire(&_initialized) != 0;
>>> >                       285     return _initialized;
>>> >                       286   }
>>> >
>>> >                     Since you said object_alloc_do_sample() is the only
>>> >                     place where you do not hold the mutex while reading
>>> >                     initialized(), I had a closer look at that. It
>>> looks
>>> >                     like in its current shape, the lack of a mutex may
>>> >                     lead to a memory leak. In particular, it first
>>> >                     checks if (initialized()). Let's assume this is now
>>> >                     true. It then allocates a bunch of stuff, and
>>> checks
>>> >                     if the number of frames were over 0. If they were,
>>> >                     it calls StackTraceStorage::storage()->add_trace()
>>> >                     seemingly hoping that after grabbing the lock in
>>> >                     there, initialized() will still return true. But it
>>> >                     could now return false and skip doing anything, in
>>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20180409/afc43758/attachment-0001.html>