JDK-8171119: Low-Overhead Heap Profiling

Thu Apr 12 00:17:04 UTC 2018

Hi Karen,

I put up a new webrev that is feature complete in my mind in terms of
implementation. There could be a few tid-bits of optimizations here and
there but I believe everything is now there in terms of features and there
is the question of placement of collectors (I've now put it lower than what
you talk about in this email thread).

The incremental webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.12_13/
and the full webrev is here:
http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.13/

The incremental webrev contains the name change of TLAB fields as per the
conversation going on in the GC thread and the Graal changes to make this
webrev not break anything. It also contains a test for when VM and sampled
events are enabled at the same time.

I'll answer your questions inlined below:

>
> I have a couple of design questions before getting to the detailed code
> review:
>
> 1. jvmtiExport.cpp
> You have simplified the JvmtiObjectAllocEventCollector to assume there is
> only a single object.
> Do you have a test that allocates a multi-dimensional array?
> I would expect that to have multiple subarrays  - and need the logic that
> you removed.
>

I "think" you misread this. I did not change the implementation of
JvmtiObjectAllocEventCollector
to assume only one object. Actually the implementation is still doing what
it was doing initially but now JvmtiObjectAllocEventCollector is a main
class with two inherited behaviors:
   - JvmtiVMObjectAllocEventCollector is following the same logic as before
   - JvmtiSampledObjectAllocEventCollector has the logic of a single
allocation during collection since it is placed that low. I'm thinking of
perhaps separating the two classes now that the sampled case is so low and
does not require the extra logic of handling a growable array.

I don't have a test that tests multi-dimensional array. I have not looked
at what exactly VMObjectAllocEvents track but I did do an example to see:

- On a clean JDK, if I allocate:
     int[][][] myAwesomeTable = new int[10][10][10];

I get one single VMObject call back.

- With my change, I get the same behavior.

So instead, I did an object containing an array and cloned it. With the
clean JDK I get two VM callbacks with and without my change.

I'll add a test in the next webrev to ensure correctness.

> *** Please add a test in which you allocate a multi-dimensional array -
> and try it with your earlier version which
> will show the full set of allocated objects
>

As I said, this works because it is so low in the system. I don't know the
internals of the allocation of a three-dimensional array but:
  - If I try to collect all allocations required for the new
int[10][10][10], I get more than a hundred allocation callbacks if the
sample rate is 0, which is what I'd expect (getting a callback at each
allocation, so what I expect)

So though I only collect one object, I get a callback for each if that is
what we want in regard to the sampling rate.

> 2. Tests - didn’t read them all - just ran into a hardcoded “10% error
> ensures a sanity test without becoming flaky”
> do check with Serguei on this - that looks like a potential future test
> failure
>

Because the sampling rate is a geometric variable around a mean of the
sampling rate, any meaningful test is going to have to be statistical.
Therefore, I do a bit of error acceptance to allow us to test for the real
thing and not hack the code to have less "real" tests. This is what we do
internally, let me know if you want me to do it otherwise.

Flaky is probably a wrong term or perhaps I need to better explain this.
I'll change the comment in the tests to explain that potentially flakyness
comes from the nature of the geometrical mean. Because we don't want too
long running tests, it makes sense to me to have this error percentage.

Let me now answer the comments from the other email here as well so we have
all answers and conversations in a single thread:

>         - Note there is one current caveat: if the agent requests the
> VMObjectAlloc event, the sampler defers to that event due to a limitation
> in its implementation (ie: I am not convinced I can safely send out that
> event with the VM collector enabled, I'll happily white board that).
>
> Please work that one out with Serguei and the serviceability folks.
>
>
Agreed. I'll follow up with Serguei if this is a potential problem, I have
to double check and ensure I am right that there is an issue. I see it as
just a matter of life and not a problem for now. IF you do want both
events, having a sample drop due to this limitation does not invalidate the
system in my mind. I could be wrong about it though and would happily go
over what I saw.

>> So the Heap Sampling Monitoring System used to have more methods. It made
>> sense to have them in a separate category. I now have moved it to the
>> memory category to be consistent and grouped there. I also removed that
>> link btw.
>>
> Thanks.
>

Actually it did seem weird to put it there since there was only
Allocate/Deallocate, so for now the method is still again in its own
category once again. If someone has a better spot, let me know.

>>
>>>
>>> I was trying to figure out a way to put the collectors farther down the
>>> call stack so as to both catch more
>>> cases and to reduce the maintenance burden - i.e. if you were to add a
>>> new code generator, e.g. Graal -
>>> if it were to go through an existing interface, that might be a place to
>>> already have a collector.
>>>
>>> I do not know the Graal sources - I did look at jvmci/jvmciRuntime.cpp -
>>> and it appears that there
>>> are calls to instanceKlass::new_instance,
>>> oopFactory::new_typeArray/new_ObjArray and ArrayKlass::multi-allocate,
>>> so one possibility would be to put hooks in those calls which would
>>> catch many? (I did not do a thorough search)
>>> of the slowpath calls for the bytecodes, and then check the fast paths
>>> in detail.
>>>
>>
>> I'll come to a major issue with the collector and its placement in the
>> next paragraph.
>>
> Still not clear on why you did not move the collectors into
> instanceKlass::new_instance and oopFactory::newtypeArray/newObjArray
> and ArrayKlass::multi-allocate.
>

I think what was happening is that the collectors would wrap the objects in
handles but references to the originally allocated object would be on the
stack still and would not get updated if required. Due to that issue, I
believe was getting weird bugs.

Because of this, it seems that any VM collector enabled has to guarantee
that either:
   - its path to destruction (and thus posting of events) has no means of
triggering a GC that would move things around (as long as you are in VM
code you should be fine I believe)
   - if GC is occuring, the objects in its internal array are not somewhere
on the stack without a handle around them to be able to be moved if need by
from a GC operation.

I'm not convinced this holds in the multithreaded with sampling and VM
collection cases.

>>
>>>
>>> I had wondered if it made sense to move the hooks even farther down,
>>> into CollectedHeap:obj_allocate and array_allocate.
>>> I do not think so. First reason is that for multidimensional arrays,
>>> ArrayKlass::multi_allocate the outer dimension array would
>>> have an event before storing the inner sub-arrays and I don’t think we
>>> want that exposed, so that won’t work for arrays.
>>>
>>
>> So the major difficulty is that the steps of collection do this:
>>
>> - An object gets allocated and is decided to be sampled
>> - The original pointer placement (where it resides originally in memory)
>> is passed to the collector
>> - Now one important thing of note:
>>     (a) In the VM code, until the point where the oop is going to be
>> returned, GC is not yet aware of it
>>
>     (b) so the collector can't yet send it out to the user via JVMTI
>> otherwise, the agent could put a weak reference for example
>>
>> I'm a bit fuzzy on this and maybe it's just that there would be more
>> heavy lifting to make this possible but my initial tests seem to show
>> problems when attempting this in the obj_allocate area.
>>
> Not sure what you are seeing here -
>
> Let me state it the way I understand it.
> 1) You can collect the object into internal metadata at allocation point -
> which you already had
> Note: see comment in JvmtiExport.cpp:
> // In the case of the sampled object collector, we don’t want to perform
> the
> // oops_do because the object in the collector is still stored in
> registers
> // on the VM stack
>    - so GC will find these objects as roots, once we allow GC to run,
> which should be after the header is initialized
>
> Totally agree with you that you can not post the event in the source code
> in which you allocate the memory - keep reading
>
> 2) event posting:
>    - you want to ensure that the object has been fully initialized
>    - the object needs to have the header set up - not just the memory
> allocated - so that applies to all objects
>    (and that is done in a caller of the allocation code - so it can’t be
> done at the location which does the memory allocation)
>    - the example I pointed out was the multianewarray case - all the
> subarrays need to be allocated
>
> - please add a test case for multi-array, so as you experiment with where
> to post the event, you ensure that you can
> access the subarrays (e.g. a 3D array of length 5 has 5 2D arrays as
> subarrays)
>

Technically it's more than just initialized, it is the fact that you cannot
perform a callback about an object if any object of that thread is being
held by a collector and also in a register/stack space without protections.

>   - prior to setting up the object header information, GC would not know
> about the object
>     - was this  by chance the issue you ran into?
>

No I believe the issue I was running into was above where an object on the
stack was pointing to an oop that got moved during a GC due to that thread
doing an event callback.

Thanks for your help,
Jc

> 3) event posting
>    - when you post the event to JVMTI
>    - in JvmtiObjectAllocEventMark: sets _jobj (object)to_jobject(obj),
> which creates JNIHandles::make_local(_thread, obj)
>
>
>>
>>>
>>> The second reason is that I strongly suspect the scope you want is
>>> bytecodes only. I think once you have added hooks
>>> to all the fast paths and slow paths that this will be pushing the
>>> performance overhead constraints you proposed and
>>> you won’t want to see e.g. internal allocations.
>>>
>>
>> Yes agreed, allocations from bytecodes are mostly our concern generally :)
>>
>>
>>>
>>>
>> But I think you need to experiment with the set of allocations (or
>>> possible alternative sets of allocations) you want recorded.
>>>
>>> The hooks I see today include:
>>> Interpreter: (looking at x86 as a sample)
>>>   - slowpath in InterpreterRuntime
>>>   - fastpath tlab allocation - your new threshold check handles that
>>>
>>
>> Agreed
>>
>>
>>>   - allow_shared_alloc (GC specific): for _new isn’t handled
>>>
>>
>> Where is that exactly? I can check why we are not catching it?
>>
>>
>>>
>>> C1
>>>   I don’t see changes in c1_Runtime.cpp
>>>   note: you also want to look for the fast path
>>>
>>
>> I added the calls to c1_Runtime in the latest webrev, but was still going
>> through testing before pushing it out. I had waited on this one a bit. Fast
>> path would be handled by the threshold check no?
>>
>>
>>> C2: changes in opto/runtime.cpp for slow path
>>>    did you also catch the fast path?
>>>
>>
>> Fast path gets handled by the same threshold check, no? Perhaps I've
>> missed something (very likely)?
>>
>>
>>>
>>> 3. Performance -
>>> After you get all the collectors added - you need to rerun the
>>> performance numbers.
>>>
>>
>> Agreed :)
>>
>>
>>>
>>> thanks,
>>> Karen
>>>
>>> On Apr 5, 2018, at 2:15 PM, JC Beyler <jcbeyler at google.com> wrote:
>>>
>>> Thanks Boris and Derek for testing it.
>>>
>>> Yes I was trying to get a new version out that had the tests ported as
>>> well but got sidetracked while trying to add tests and two new features.
>>>
>>> Here is the incremental webrev:
>>>
>>> Here is the full webrev:
>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.11/
>>>
>>> Basically, the new tests assert this:
>>>   - Only one agent can currently ask for the sampling, I'm currently
>>> seeing if I can push to a next webrev the multi-agent support to start
>>> doing a code freeze on this one
>>>   - The event is not thread-enabled, meaning like the
>>> VMObjectAllocationEvent, it's an all or nothing event; same as the
>>> multi-agent, I'm going to see if a future webrev to add the support is a
>>> better idea to freeze this webrev a bit
>>>
>>> There was another item that I added here and I'm unsure this webrev is
>>> stable in debug mode: I added an assertion system to ascertain that all
>>> paths leading to a TLAB slow path (and hence a sampling point) have a
>>> sampling collector ready to post the event if a user wants it. This might
>>> break a few thing in debug mode as I'm working through the kinks of that as
>>> well. However, in release mode, this new webrev passes all the tests in
>>> hotspot/jtreg/serviceability/jvmti/HeapMonitor.
>>>
>>> Let me know what you think,
>>> Jc
>>>
>>> On Thu, Apr 5, 2018 at 4:56 AM Boris Ulasevich <
>>> boris.ulasevich at bell-sw.com> wrote:
>>>
>>>> Hi JC,
>>>>
>>>>    I have just checked on arm32: your patch compiles and runs ok.
>>>>
>>>>    As I can see, jtreg agentlib name "-agentlib:HeapMonitor" does not
>>>> correspond to actual library name: libHeapMonitorTest.c ->
>>>> libHeapMonitorTest.so
>>>>
>>>> Boris
>>>>
>>>> On 04.04.2018 01:54, White, Derek wrote:
>>>> > Thanks JC,
>>>> >
>>>> > New patch applies cleanly. Compiles and runs (simple test programs) on
>>>> > aarch64.
>>>> >
>>>> >   * Derek
>>>> >
>>>> > *From:* JC Beyler [mailto:jcbeyler at google.com]
>>>> > *Sent:* Monday, April 02, 2018 1:17 PM
>>>> > *To:* White, Derek <Derek.White at cavium.com>
>>>> > *Cc:* Erik Österlund <erik.osterlund at oracle.com>;
>>>> > serviceability-dev at openjdk.java.net; hotspot-compiler-dev
>>>> > <hotspot-compiler-dev at openjdk.java.net>
>>>> > *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>>>> >
>>>> > Hi Derek,
>>>> >
>>>> > I know there were a few things that went in that provoked a merge
>>>> > conflict. I worked on it and got it up to date. Sadly my lack of
>>>> > knowledge makes it a full rebase instead of keeping all the history.
>>>> > However, with a newly cloned jdk/hs you should now be able to use:
>>>> >
>>>> > http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/
>>>> >
>>>> > The change you are referring to was done with the others so perhaps
>>>> you
>>>> > were unlucky and I forgot it in a webrev and fixed it in another? I
>>>> > don't know but it's been there and I checked, it is here:
>>>> >
>>>> >
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.10/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp.udiff.html
>>>> >
>>>> > I double checked that tlab_end_offset no longer appears in any
>>>> > architecture (as far as I can tell :)).
>>>> >
>>>> > Thanks for testing and let me know if you run into any other issues!
>>>> >
>>>> > Jc
>>>> >
>>>> > On Fri, Mar 30, 2018 at 4:24 PM White, Derek <Derek.White at cavium.com
>>>> > <mailto:Derek.White at cavium.com>> wrote:
>>>> >
>>>> >     Hi Jc,
>>>> >
>>>> >     I’ve been having trouble getting your patch to apply correctly. I
>>>> >     may have based it on the wrong version.
>>>> >
>>>> >     In any case, I think there’s a missing update to
>>>> >     macroAssembler_aarch64.cpp, in MacroAssembler::tlab_allocate(),
>>>> >     where “JavaThread::tlab_end_offset()” should become
>>>> >     “JavaThread::tlab_current_end_offset()”.
>>>> >
>>>> >     This should correspond to the other port’s changes in
>>>> >     templateTable_<cpu>.cpp files.
>>>> >
>>>> >     Thanks!
>>>> >     - Derek
>>>> >
>>>> >     *From:* hotspot-compiler-dev
>>>> >     [mailto:hotspot-compiler-dev-bounces at openjdk.java.net
>>>> >     <mailto:hotspot-compiler-dev-bounces at openjdk.java.net>] *On
>>>> Behalf
>>>> >     Of *JC Beyler
>>>> >     *Sent:* Wednesday, March 28, 2018 11:43 AM
>>>> >     *To:* Erik Österlund <erik.osterlund at oracle.com
>>>> >     <mailto:erik.osterlund at oracle.com>>
>>>> >     *Cc:* serviceability-dev at openjdk.java.net
>>>> >     <mailto:serviceability-dev at openjdk.java.net>;
>>>> hotspot-compiler-dev
>>>> >     <hotspot-compiler-dev at openjdk.java.net
>>>> >     <mailto:hotspot-compiler-dev at openjdk.java.net>>
>>>> >     *Subject:* Re: JDK-8171119: Low-Overhead Heap Profiling
>>>> >
>>>> >     Hi all,
>>>> >
>>>> >     I've been working on deflaking the tests mostly and the wording in
>>>> >     the JVMTI spec.
>>>> >
>>>> >     Here is the two incremental webrevs:
>>>> >
>>>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.5_6/
>>>> >
>>>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.06_07/
>>>> >
>>>> >     Here is the total webrev:
>>>> >
>>>> >     http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event.07/
>>>> >
>>>> >     Here are the notes of this change:
>>>> >
>>>> >        - Currently the tests pass 100 times in a row, I am working on
>>>> >     checking if they pass 1000 times in a row.
>>>> >
>>>> >        - The default sampling rate is set to 512k, this is what we use
>>>> >     internally and having a default means that to enable the sampling
>>>> >     with the default, the user only has to do a enable event/disable
>>>> >     event via JVMTI (instead of enable + set sample rate).
>>>> >
>>>> >        - I deprecated the code that was handling the fast path tlab
>>>> >     refill if it happened since this is now deprecated
>>>> >
>>>> >            - Though I saw that Graal is still using it so I have to
>>>> see
>>>> >     what needs to be done there exactly
>>>> >
>>>> >     Finally, using the Dacapo benchmark suite, I noted a 1% overhead
>>>> for
>>>> >     when the event system is turned on and the callback to the native
>>>> >     agent is just empty. I got a 3% overhead with a 512k sampling rate
>>>> >     with the code I put in the native side of my tests.
>>>> >
>>>> >     Thanks and comments are appreciated,
>>>> >
>>>> >     Jc
>>>> >
>>>> >     On Mon, Mar 19, 2018 at 2:06 PM JC Beyler <jcbeyler at google.com
>>>> >     <mailto:jcbeyler at google.com>> wrote:
>>>> >
>>>> >         Hi all,
>>>> >
>>>> >         The incremental webrev update is here:
>>>> >
>>>> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event4_5/
>>>> >
>>>> >         The full webrev is here:
>>>> >
>>>> >         http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/
>>>> >
>>>> >         Major change here is:
>>>> >
>>>> >            - I've removed the heapMonitoring.cpp code in favor of just
>>>> >         having the sampling events as per Serguei's request; I still
>>>> >         have to do some overhead measurements but the tests prove the
>>>> >         concept can work
>>>> >
>>>> >                 - Most of the tlab code is unchanged, the only major
>>>> >         part is that now things get sent off to event collectors when
>>>> >         used and enabled.
>>>> >
>>>> >            - Added the interpreter collectors to handle interpreter
>>>> >         execution
>>>> >
>>>> >            - Updated the name from SetTlabHeapSampling to
>>>> >         SetHeapSampling to be more generic
>>>> >
>>>> >            - Added a mutex for the thread sampling so that we can
>>>> >         initialize an internal static array safely
>>>> >
>>>> >            - Ported the tests from the old system to this new one
>>>> >
>>>> >         I've also updated the JEP and CSR to reflect these changes:
>>>> >
>>>> >         https://bugs.openjdk.java.net/browse/JDK-8194905
>>>> >
>>>> >         https://bugs.openjdk.java.net/browse/JDK-8171119
>>>> >
>>>> >         In order to make this have some forward progress, I've removed
>>>> >         the heap sampling code entirely and now rely entirely on the
>>>> >         event sampling system. The tests reflect this by using a
>>>> >         simplified implementation of what an agent could do:
>>>> >
>>>> >
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/heap_event5/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/libHeapMonitor.c
>>>> >
>>>> >         (Search for anything mentioning event_storage).
>>>> >
>>>> >         I have not taken the time to port the whole code we had
>>>> >         originally in heapMonitoring to this. I hesitate only because
>>>> >         that code was in C++, I'd have to port it to C and this is for
>>>> >         tests so perhaps what I have now is good enough?
>>>> >
>>>> >         As far as testing goes, I've ported all the relevant tests and
>>>> >         then added a few:
>>>> >
>>>> >             - Turning the system on/off
>>>> >
>>>> >             - Testing using various GCs
>>>> >
>>>> >             - Testing using the interpreter
>>>> >
>>>> >             - Testing the sampling rate
>>>> >
>>>> >             - Testing with objects and arrays
>>>> >
>>>> >             - Testing with various threads
>>>> >
>>>> >         Finally, as overhead goes, I have the numbers of the system
>>>> off
>>>> >         vs a clean build and I have 0% overhead, which is what we'd
>>>> >         want. This was using the Dacapo benchmarks. I am now preparing
>>>> >         to run a version with the events on using dacapo and will
>>>> report
>>>> >         back here.
>>>> >
>>>> >         Any comments are welcome :)
>>>> >
>>>> >         Jc
>>>> >
>>>> >         On Thu, Mar 8, 2018 at 4:00 PM JC Beyler <jcbeyler at google.com
>>>> >         <mailto:jcbeyler at google.com>> wrote:
>>>> >
>>>> >             Hi all,
>>>> >
>>>> >             I apologize for the delay but I wanted to add an event
>>>> >             system and that took a bit longer than expected and I also
>>>> >             reworked the code to take into account the deprecation of
>>>> >             FastTLABRefill.
>>>> >
>>>> >             This update has four parts:
>>>> >
>>>> >             A) I moved the implementation from Thread to
>>>> >             ThreadHeapSampler inside of Thread. Would you prefer it
>>>> as a
>>>> >             pointer inside of Thread or like this works for you?
>>>> Second
>>>> >             question would be would you rather have an association
>>>> >             outside of Thread altogether that tries to remember when
>>>> >             threads are live and then we would have something like:
>>>> >
>>>> >             ThreadHeapSampler::get_sampling_size(this_thread);
>>>> >
>>>> >             I worry about the overhead of this but perhaps it is not
>>>> too
>>>> >             too bad?
>>>> >
>>>> >             B) I also have been working on the Allocation event system
>>>> >             that sends out a notification at each sampled event. This
>>>> >             will be practical when wanting to do something at the
>>>> >             allocation point. I'm also looking at if the whole
>>>> >             heapMonitoring code could not reside in the agent code and
>>>> >             not in the JDK. I'm not convinced but I'm talking to
>>>> Serguei
>>>> >             about it to see/assess :)
>>>> >
>>>> >                 - Also added two tests for the new event subsystem
>>>> >
>>>> >             C) Removed the slow_path fields inside the TLAB code since
>>>> >             now FastTLABRefill is deprecated
>>>> >
>>>> >             D) Updated the JVMTI documentation and specification for
>>>> the
>>>> >             methods.
>>>> >
>>>> >             So the incremental webrev is here:
>>>> >
>>>> >
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.09_10/
>>>> >
>>>> >             and the full webrev is here:
>>>> >
>>>> >             http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.10
>>>> >
>>>> >             I believe I have updated the various JIRA issues that
>>>> track
>>>> >             this :)
>>>> >
>>>> >             Thanks for your input,
>>>> >
>>>> >             Jc
>>>> >
>>>> >             On Wed, Feb 14, 2018 at 10:34 PM, JC Beyler
>>>> >             <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
>>>> >
>>>> >                 Hi Erik,
>>>> >
>>>> >                 I inlined my answers, which the last one seems to
>>>> answer
>>>> >                 Robbin's concerns about the same thing (adding things
>>>> to
>>>> >                 Thread).
>>>> >
>>>> >                 On Wed, Feb 14, 2018 at 2:51 AM, Erik Österlund
>>>> >                 <erik.osterlund at oracle.com
>>>> >                 <mailto:erik.osterlund at oracle.com>> wrote:
>>>> >
>>>> >                     Hi JC,
>>>> >
>>>> >                     Comments are inlined below.
>>>> >
>>>> >                     On 2018-02-13 06:18, JC Beyler wrote:
>>>> >
>>>> >                         Hi Erik,
>>>> >
>>>> >                         Thanks for your answers, I've now inlined my
>>>> own
>>>> >                         answers/comments.
>>>> >
>>>> >                         I've done a new webrev here:
>>>> >
>>>> >
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.08/
>>>> >                         <
>>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.08/>
>>>> >
>>>> >                         The incremental is here:
>>>> >
>>>> >
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.07_08/
>>>> >                         <
>>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.07_08/>
>>>> >
>>>> >                         Note to all:
>>>> >
>>>> >                            - I've been integrating changes from
>>>> >                         Erin/Serguei/David comments so this webrev
>>>> >                         incremental is a bit an answer to all comments
>>>> >                         in one. I apologize for that :)
>>>> >
>>>> >                         On Mon, Feb 12, 2018 at 6:05 AM, Erik
>>>> Österlund
>>>> >                         <erik.osterlund at oracle.com
>>>> >                         <mailto:erik.osterlund at oracle.com>> wrote:
>>>> >
>>>> >                             Hi JC,
>>>> >
>>>> >                             Sorry for the delayed reply.
>>>> >
>>>> >                             Inlined answers:
>>>> >
>>>> >
>>>> >
>>>> >                             On 2018-02-06 00:04, JC Beyler wrote:
>>>> >
>>>> >                                 Hi Erik,
>>>> >
>>>> >                                 (Renaming this to be folded into the
>>>> >                                 newly renamed thread :))
>>>> >
>>>> >                                 First off, thanks a lot for reviewing
>>>> >                                 the webrev! I appreciate it!
>>>> >
>>>> >                                 I updated the webrev to:
>>>> >
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.05a/
>>>> >                                 <
>>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.05a/>
>>>> >
>>>> >                                 And the incremental one is here:
>>>> >
>>>> http://cr.openjdk.java.net/~jcbeyler/8171119/webrev.04_05a/
>>>> >                                 <
>>>> http://cr.openjdk.java.net/%7Ejcbeyler/8171119/webrev.04_05a/>
>>>> >
>>>> >                                 It contains:
>>>> >                                 - The change for since from 9 to 11
>>>> for
>>>> >                                 the jvmti.xml
>>>> >                                 - The use of the OrderAccess for
>>>> initialized
>>>> >                                 - Clearing the oop
>>>> >
>>>> >                                 I also have inlined my answers to your
>>>> >                                 comments. The biggest question
>>>> >                                 will come from the multiple *_end
>>>> >                                 variables. A bit of the logic there
>>>> >                                 is due to handling the slow path
>>>> refill
>>>> >                                 vs fast path refill and
>>>> >                                 checking that the rug was not pulled
>>>> >                                 underneath the slowpath. I
>>>> >                                 believe that a previous comment was
>>>> that
>>>> >                                 TlabFastRefill was going to
>>>> >                                 be deprecated.
>>>> >
>>>> >                                 If this is true, we could revert this
>>>> >                                 code a bit and just do a : if
>>>> >                                 TlabFastRefill is enabled, disable
>>>> this.
>>>> >                                 And then deprecate that when
>>>> >                                 TlabFastRefill is deprecated.
>>>> >
>>>> >                                 This might simplify this webrev and I
>>>> >                                 can work on a follow-up that
>>>> >                                 either: removes TlabFastRefill if
>>>> Robbin
>>>> >                                 does not have the time to do
>>>> >                                 it or add the support to the assembly
>>>> >                                 side to handle this correctly.
>>>> >                                 What do you think?
>>>> >
>>>> >                             I support removing TlabFastRefill, but I
>>>> >                             think it is good to not depend on that
>>>> >                             happening first.
>>>> >
>>>> >
>>>> >                         I'm slowly pushing on the FastTLABRefill
>>>> >                         (
>>>> https://bugs.openjdk.java.net/browse/JDK-8194084),
>>>> >                         I agree on keeping both separate for now
>>>> though
>>>> >                         so that we can think of both differently
>>>> >
>>>> >                                 Now, below, inlined are my answers:
>>>> >
>>>> >                                 On Fri, Feb 2, 2018 at 8:44 AM, Erik
>>>> >                                 Österlund
>>>> >                                 <erik.osterlund at oracle.com
>>>> >                                 <mailto:erik.osterlund at oracle.com>>
>>>> wrote:
>>>> >
>>>> >                                     Hi JC,
>>>> >
>>>> >                                     Hope I am reviewing the right
>>>> >                                     version of your work. Here goes...
>>>> >
>>>> >
>>>>  src/hotspot/share/gc/shared/collectedHeap.inline.hpp:
>>>> >
>>>> >                                        159
>>>> >
>>>>  AllocTracer::send_allocation_outside_tlab(klass, result, size *
>>>> >                                     HeapWordSize, THREAD);
>>>> >                                        160
>>>> >                                        161
>>>> >
>>>>  THREAD->tlab().handle_sample(THREAD, result, size);
>>>> >                                        162     return result;
>>>> >                                        163   }
>>>> >
>>>> >                                     Should not call tlab()->X without
>>>> >                                     checking if (UseTLAB) IMO.
>>>> >
>>>> >                                 Done!
>>>> >
>>>> >
>>>> >                             More about this later.
>>>> >
>>>> >
>>>>  src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp:
>>>> >
>>>> >                                     So first of all, there seems to
>>>> >                                     quite a few ends. There is an
>>>> "end",
>>>> >                                     a "hard
>>>> >                                     end", a "slow path end", and an
>>>> >                                     "actual end". Moreover, it seems
>>>> >                                     like the
>>>> >                                     "hard end" is actually further
>>>> away
>>>> >                                     than the "actual end". So the
>>>> "hard end"
>>>> >                                     seems like more of a "really
>>>> >                                     definitely actual end" or
>>>> something.
>>>> >                                     I don't
>>>> >                                     know about you, but I think it
>>>> looks
>>>> >                                     kind of messy. In particular, I
>>>> don't
>>>> >                                     feel like the name "actual end"
>>>> >                                     reflects what it represents,
>>>> >                                     especially when
>>>> >                                     there is another end that is
>>>> behind
>>>> >                                     the "actual end".
>>>> >
>>>> >                                        413 HeapWord*
>>>> >
>>>>  ThreadLocalAllocBuffer::hard_end() {
>>>> >                                        414   // Did a fast TLAB refill
>>>> >                                     occur?
>>>> >                                        415   if (_slow_path_end !=
>>>> _end) {
>>>> >                                        416     // Fix up the actual
>>>> end
>>>> >                                     to be now the end of this TLAB.
>>>> >                                        417     _slow_path_end = _end;
>>>> >                                        418     _actual_end = _end;
>>>> >                                        419   }
>>>> >                                        420
>>>> >                                        421   return _actual_end +
>>>> >                                     alignment_reserve();
>>>> >                                        422 }
>>>> >
>>>> >                                     I really do not like making
>>>> getters
>>>> >                                     unexpectedly have these kind of
>>>> side
>>>> >                                     effects. It is not expected that
>>>> >                                     when you ask for the "hard end",
>>>> you
>>>> >                                     implicitly update the "slow path
>>>> >                                     end" and "actual end" to new
>>>> values.
>>>> >
>>>> >                                 As I said, a lot of this is due to the
>>>> >                                 FastTlabRefill. If I make this
>>>> >                                 not supporting FastTlabRefill, this
>>>> goes
>>>> >                                 away. The reason the system
>>>> >                                 needs to update itself at the get is
>>>> >                                 that you only know at that get if
>>>> >                                 things have shifted underneath the
>>>> tlab
>>>> >                                 slow path. I am not sure of
>>>> >                                 really better names (naming is hard!),
>>>> >                                 perhaps we could do these
>>>> >                                 names:
>>>> >
>>>> >                                 - current_tlab_end       // Either the
>>>> >                                 allocated tlab end or a sampling point
>>>> >                                 - last_allocation_address  // The end
>>>> of
>>>> >                                 the tlab allocation
>>>> >                                 - last_slowpath_allocated_end  // In
>>>> >                                 case a fast refill occurred the
>>>> >                                 end might have changed, this is to
>>>> >                                 remember slow vs fast past refills
>>>> >
>>>> >                                 the hard_end method can be renamed to
>>>> >                                 something like:
>>>> >                                 tlab_end_pointer()        // The end
>>>> of
>>>> >                                 the lab including a bit of
>>>> >                                 alignment reserved bytes
>>>> >
>>>> >                             Those names sound better to me. Could you
>>>> >                             please provide a mapping from the old
>>>> names
>>>> >                             to the new names so I understand which one
>>>> >                             is which please?
>>>> >
>>>> >                             This is my current guess of what you are
>>>> >                             proposing:
>>>> >
>>>> >                             end -> current_tlab_end
>>>> >                             actual_end -> last_allocation_address
>>>> >                             slow_path_end ->
>>>> last_slowpath_allocated_end
>>>> >                             hard_end -> tlab_end_pointer
>>>> >
>>>> >                         Yes that is correct, that was what I was
>>>> proposing.
>>>> >
>>>> >                             I would prefer this naming:
>>>> >
>>>> >                             end -> slow_path_end // the end for
>>>> taking a
>>>> >                             slow path; either due to sampling or
>>>> refilling
>>>> >                             actual_end -> allocation_end // the end
>>>> for
>>>> >                             allocations
>>>> >                             slow_path_end -> last_slow_path_end //
>>>> last
>>>> >                             address for slow_path_end (as opposed to
>>>> >                             allocation_end)
>>>> >                             hard_end -> reserved_end // the end of the
>>>> >                             reserved space of the TLAB
>>>> >
>>>> >                             About setting things in the getter... that
>>>> >                             still seems like a very unpleasant thing
>>>> to
>>>> >                             me. It would be better to inspect the call
>>>> >                             hierarchy and explicitly update the ends
>>>> >                             where they need updating, and assert in
>>>> the
>>>> >                             getter that they are in sync, rather than
>>>> >                             implicitly setting various ends as a
>>>> >                             surprising side effect in a getter. It
>>>> looks
>>>> >                             like the call hierarchy is very small.
>>>> With
>>>> >                             my new naming convention, reserved_end()
>>>> >                             would presumably return _allocation_end +
>>>> >                             alignment_reserve(), and have an assert
>>>> >                             checking that _allocation_end ==
>>>> >                             _last_slow_path_allocation_end,
>>>> complaining
>>>> >                             that this invariant must hold, and that a
>>>> >                             caller to this function, such as
>>>> >                             make_parsable(), must first explicitly
>>>> >                             synchronize the ends as required, to honor
>>>> >                             that invariant.
>>>> >
>>>> >
>>>> >                         I've renamed the variables to how you
>>>> preferred
>>>> >                         it except for the _end one. I did:
>>>> >
>>>> >                         current_end
>>>> >
>>>> >                         last_allocation_address
>>>> >
>>>> >                         tlab_end_ptr
>>>> >
>>>> >                         The reason is that the architecture dependent
>>>> >                         code use the thread.hpp API and it already has
>>>> >                         tlab included into the name so it becomes
>>>> >                         tlab_current_end (which is better that
>>>> >                         tlab_current_tlab_end in my opinion).
>>>> >
>>>> >                         I also moved the update into a separate method
>>>> >                         with a TODO that says to remove it when
>>>> >                         FastTLABRefill is deprecated
>>>> >
>>>> >                     This looks a lot better now. Thanks.
>>>> >
>>>> >                     Note that the following comment now needs updating
>>>> >                     accordingly in threadLocalAllocBuffer.hpp:
>>>> >
>>>> >                        41 //            Heap sampling is performed via
>>>> >                     the end/actual_end fields.
>>>> >
>>>> >                        42 //            actual_end contains the real
>>>> end
>>>> >                     of the tlab allocation,
>>>> >
>>>> >                        43 //            whereas end can be set to an
>>>> >                     arbitrary spot in the tlab to
>>>> >
>>>> >                        44 //            trip the return and sample the
>>>> >                     allocation.
>>>> >
>>>> >                        45 //            slow_path_end is used to track
>>>> >                     if a fast tlab refill occured
>>>> >
>>>> >                        46 //            between slowpath calls.
>>>> >
>>>> >                     There might be other comments too, I have not
>>>> looked
>>>> >                     in detail.
>>>> >
>>>> >                 This was the only spot that still had an actual_end, I
>>>> >                 fixed it now. I'll do a sweep to double check other
>>>> >                 comments.
>>>> >
>>>> >
>>>> >
>>>> >                                 Not sure it's better but before
>>>> updating
>>>> >                                 the webrev, I wanted to try
>>>> >                                 to get input/consensus :)
>>>> >
>>>> >                                 (Note hard_end was always further off
>>>> >                                 than end).
>>>> >
>>>> >                                     src/hotspot/share/prims/jvmti.xml:
>>>> >
>>>> >                                     10357       <capabilityfield
>>>> >                                     id="can_sample_heap" since="9">
>>>> >                                     10358         <description>
>>>> >                                     10359           Can sample the
>>>> heap.
>>>> >                                     10360           If this capability
>>>> >                                     is enabled then the heap sampling
>>>> >                                     methods
>>>> >                                     can be called.
>>>> >                                     10361         </description>
>>>> >                                     10362       </capabilityfield>
>>>> >
>>>> >                                     Looks like this capability should
>>>> >                                     not be "since 9" if it gets
>>>> integrated
>>>> >                                     now.
>>>> >
>>>> >                                 Updated now to 11, crossing my
>>>> fingers :)
>>>> >
>>>> >
>>>>  src/hotspot/share/runtime/heapMonitoring.cpp:
>>>> >
>>>> >                                        448       if
>>>> >                                     (is_alive->do_object_b(value)) {
>>>> >                                        449         // Update the oop
>>>> to
>>>> >                                     point to the new object if it is
>>>> still
>>>> >                                     alive.
>>>> >                                        450
>>>>  f->do_oop(&(trace.obj));
>>>> >                                        451
>>>> >                                        452         // Copy the old
>>>> >                                     trace, if it is still live.
>>>> >                                        453
>>>> >
>>>>  _allocated_traces->at_put(curr_pos++, trace);
>>>> >                                        454
>>>> >                                        455         // Store the live
>>>> >                                     trace in a cache, to be served up
>>>> on
>>>> >                                     /heapz.
>>>> >                                        456
>>>> >
>>>>  _traces_on_last_full_gc->append(trace);
>>>> >                                        457
>>>> >                                        458         count++;
>>>> >                                        459       } else {
>>>> >                                        460         // If the old trace
>>>> >                                     is no longer live, add it to
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20180412/2ae6f00d/attachment-0001.html>