Low-Overhead Heap Profiling

Mon Sep 25 22:01:45 UTC 2017

Hi all,

After a bit of a break, I am back working on this :). As before, here are
two webrevs:

- Full change set: http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/
- Compared to version 8:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/
    (This version is compared to version 8 I last showed but ported to the
new folder hierarchy)

In this version I have:
  - Handled Thomas' comments from his email of 07/03:
       - Merged the logging to be standard
       - Fixed up the code a bit where asked
       - Added some notes about the code not being thread-safe yet
   - Removed additional dead code from the version that modifies
interpreter/c1/c2
   - Fixed compiler issues so that it compiles with
--disable-precompiled-header
        - Tested with ./configure --with-boot-jdk=<jdk8>
--with-debug-level=slowdebug --disable-precompiled-headers

Additionally, I added a test to check the sanity of the sampler:
HeapMonitorStatCorrectnessTest
(
http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch
)
   - This allocates a number of arrays and checks that we obtain the number
of samples we want with an accepted error of 5%. I tested it 100 times and
it passed everytime, I can test more if wanted
   - Not in the test are the actual numbers I got for the various array
sizes, I ran the program 30 times and parsed the output; here are the
averages and standard deviation:
      1000:     1.28% average; 1.13% standard deviation
      10000:    1.59% average; 1.25% standard deviation
      100000:   1.26% average; 1.26% standard deviation

What this means is that we were always at about 1~2% of the number of
samples the test expected.

Let me know what you think,
Jc

On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <jcbeyler at google.com> wrote:

> Hi all,
>
> I apologize, I have not yet handled your remarks but thought this new
> webrev would also be useful to see and comment on perhaps.
>
> Here is the latest webrev, it is generated slightly different than the
> others since now I'm using webrev.ksh without the -N option:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/
>
> And the webrev.07 to webrev.08 diff is here:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/
>
> (Let me know if it works well)
>
> It's a small change between versions but it:
>   - provides a fix that makes the average sample rate correct (more on
> that below).
>   - fixes the code to actually have it play nicely with the fast tlab
> refill
>   - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
> - moved the capability to be onload solo
>
> With this webrev, I've done a small study of the random number generator
> we use here for the sampling rate. I took a small program and it can be
> simplified to:
>
> for (outer loop)
> for (inner loop)
> int[] tmp = new int[arraySize];
>
> - I've fixed the outer and inner loops to being 800 for this experiment,
> meaning we allocate 640000 times an array of a given array size.
>
> - Each program provides the average sample size used for the whole
> execution
>
> - Then, I ran each variation 30 times and then calculated the average of
> the average sample size used for various array sizes. I selected the array
> size to be one of the following: 1, 10, 100, 1000.
>
> - When compared to 512kb, the average sample size of 30 runs:
> 1: 4.62% of error
> 10: 3.09% of error
> 100: 0.36% of error
> 1000: 0.1% of error
> 10000: 0.03% of error
>
> What it shows is that, depending on the number of samples, the average
> does become better. This is because with an allocation of 1 element per
> array, it will take longer to hit one of the thresholds. This is seen by
> looking at the sample count statistic I put in. For the same number of
> iterations (800 * 800), the different array sizes provoke:
> 1: 62 samples
> 10: 125 samples
> 100: 788 samples
> 1000: 6166 samples
> 10000: 57721 samples
>
> And of course, the more samples you have, the more sample rates you pick,
> which means that your average gets closer using that math.
>
> Thanks,
> Jc
>
> On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <jcbeyler at google.com> wrote:
>
>> Thanks Robbin,
>>
>> This seems to have worked. When I have the next webrev ready, we will
>> find out but I'm fairly confident it will work!
>>
>> Thanks agian!
>> Jc
>>
>> On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <robbin.ehn at oracle.com>
>> wrote:
>>
>>> Hi JC,
>>>
>>> On 06/29/2017 12:15 AM, JC Beyler wrote:
>>>
>>>> B) Incremental changes
>>>>
>>>
>>> I guess the most common work flow here is using mq :
>>> hg qnew fix_v1
>>> edit files
>>> hg qrefresh
>>> hg qnew fix_v2
>>> edit files
>>> hg qrefresh
>>>
>>> if you do hg log you will see 2 commits
>>>
>>> webrev.ksh -r -2 -o my_inc_v1_v2
>>> webrev.ksh -o my_full_v2
>>>
>>>
>>> In  your .hgrc you might need:
>>> [extensions]
>>> mq =
>>>
>>> /Robbin
>>>
>>>
>>>> Again another newbiew question here...
>>>>
>>>> For showing the incremental changes, is there a link that explains how
>>>> to do that? I apologize for my newbie questions all the time :)
>>>>
>>>> Right now, I do:
>>>>
>>>>   ksh ../webrev.ksh -m -N
>>>>
>>>> That generates a webrev.zip and send it to Chuck Rasbold. He then
>>>> uploads it to a new webrev.
>>>>
>>>> I tried commiting my change and adding a small change. Then if I just
>>>> do ksh ../webrev.ksh without any options, it seems to produce a similar
>>>> page but now with only the changes I had (so the 06-07 comparison you were
>>>> talking about) and a changeset that has it all. I imagine that is what you
>>>> meant.
>>>>
>>>> Which means that my workflow would become:
>>>>
>>>> 1) Make changes
>>>> 2) Make a webrev without any options to show just the differences with
>>>> the tip
>>>> 3) Amend my changes to my local commit so that I have it done with
>>>> 4) Go to 1
>>>>
>>>> Does that seem correct to you?
>>>>
>>>> Note that when I do this, I only see the full change of a file in the
>>>> full change set (Side note here: now the page says change set and not
>>>> patch, which is maybe why Serguei was having issues?).
>>>>
>>>> Thanks!
>>>> Jc
>>>>
>>>>
>>>>
>>>> On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <robbin.ehn at oracle.com
>>>> <mailto:robbin.ehn at oracle.com>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     On 06/28/2017 12:04 AM, JC Beyler wrote:
>>>>
>>>>         Dear Thomas et al,
>>>>
>>>>         Here is the newest webrev:
>>>>         http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <
>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
>>>>
>>>>
>>>>
>>>>     You have some more bits to in there but generally this looks good
>>>> and really nice with more tests.
>>>>     I'll do and deep dive and re-test this when I get back from my long
>>>> vacation with whatever patch version you have then.
>>>>
>>>>     Also I think it's time you provide incremental (v06->07 changes) as
>>>> well as complete change-sets.
>>>>
>>>>     Thanks, Robbin
>>>>
>>>>
>>>>
>>>>
>>>>         Thomas, I "think" I have answered all your remarks. The summary
>>>> is:
>>>>
>>>>         - The statistic system is up and provides insight on what the
>>>> heap sampler is doing
>>>>              - I've noticed that, though the sampling rate is at the
>>>> right mean, we are missing some samples, I have not yet tracked out why
>>>> (details below)
>>>>
>>>>         - I've run a tiny benchmark that is the worse case: it is a
>>>> very tight loop and allocated a small array
>>>>              - In this case, I see no overhead when the system is off
>>>> so that is a good start :)
>>>>              - I see right now a high overhead in this case when
>>>> sampling is on. This is not a really too surprising but I'm going to see if
>>>> this is consistent with our
>>>>         internal implementation. The benchmark is really allocation
>>>> stressful so I'm not too surprised but I want to do the due diligence.
>>>>
>>>>            - The statistic system up is up and I have a new test
>>>>         http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/s
>>>> erviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTes
>>>> t.java.patch
>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/
>>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTe
>>>> st.java.patch>
>>>>               - I did a bit of a study about the random generator here,
>>>> more details are below but basically it seems to work well
>>>>
>>>>            - I added a capability but since this is the first time
>>>> doing this, I was not sure I did it right
>>>>              - I did add a test though for it and the test seems to do
>>>> what I expect (all methods are failing with the
>>>> JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
>>>>                  - http://cr.openjdk.java.net/~ra
>>>> sbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonito
>>>> r/MyPackage/HeapMonitorNoCapabilityTest.java.patch
>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/
>>>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapa
>>>> bilityTest.java.patch>
>>>>
>>>>            - I still need to figure out what to do about the
>>>> multi-agent vs single-agent issue
>>>>
>>>>            - As far as measurements, it seems I still need to look at:
>>>>              - Why we do the 20 random calls first, are they necessary?
>>>>              - Look at the mean of the sampling rate that the random
>>>> generator does and also what is actually sampled
>>>>              - What is the overhead in terms of memory/performance when
>>>> on?
>>>>
>>>>         I have inlined my answers, I think I got them all in the new
>>>> webrev, let me know your thoughts.
>>>>
>>>>         Thanks again!
>>>>         Jc
>>>>
>>>>
>>>>         On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <
>>>> thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com> <mailto:
>>>> thomas.schatzl at oracle.com
>>>>
>>>>         <mailto:thomas.schatzl at oracle.com>>> wrote:
>>>>
>>>>              Hi,
>>>>
>>>>              On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
>>>>              > Hi all,
>>>>              >
>>>>              > First off: Thanks again to Robbin and Thomas for their
>>>> reviews :)
>>>>              >
>>>>              > Next, I've uploaded a new webrev:
>>>>              > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <
>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <
>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
>>>>
>>>>              >
>>>>              > Here is an update:
>>>>              >
>>>>              > - @Robbin, I forgot to say that yes I need to look at
>>>> implementing
>>>>              > this for the other architectures and testing it before
>>>> it is all
>>>>              > ready to go. Is it common to have it working on all
>>>> possible
>>>>              > combinations or is there a subset that I should be doing
>>>> first and we
>>>>              > can do the others later?
>>>>              > - I've tested slowdebug, built and ran the JTreg tests I
>>>> wrote with
>>>>              > slowdebug and fixed a few more issues
>>>>              > - I've refactored a bit of the code following Thomas'
>>>> comments
>>>>              >    - I think I've handled all the comments from Thomas
>>>> (I put
>>>>              > comments inline below for the specifics)
>>>>
>>>>              Thanks for handling all those.
>>>>
>>>>              > - Following Thomas' comments on statistics, I want to
>>>> add some
>>>>              > quality assurance tests and find that the easiest way
>>>> would be to
>>>>              > have a few counters of what is happening in the sampler
>>>> and expose
>>>>              > that to the user.
>>>>              >    - I'll be adding that in the next version if no one
>>>> sees any
>>>>              > objections to that.
>>>>              >    - This will allow me to add a sanity test in JTreg
>>>> about number of
>>>>              > samples and average of sampling rate
>>>>              >
>>>>              > @Thomas: I had a few questions that I inlined below but
>>>> I will
>>>>              > summarize the "bigger ones" here:
>>>>              >    - You mentioned constants are not using the right
>>>> conventions, I
>>>>              > looked around and didn't see any convention except
>>>> normal naming then
>>>>              > for static constants. Is that right?
>>>>
>>>>              I looked through https://wiki.openjdk.java.net/
>>>> display/HotSpot/StyleGui <https://wiki.openjdk.java.net
>>>> /display/HotSpot/StyleGui>
>>>>         <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <
>>>> https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
>>>>              de and the rule is to "follow an existing pattern and must
>>>> have a
>>>>              distinct appearance from other names". Which does not help
>>>> a lot I
>>>>              guess :/ The GC team started using upper camel case, e.g.
>>>>              SomeOtherConstant, but very likely this is probably not
>>>> applied
>>>>              consistently throughout. So I am fine with not adding
>>>> another style
>>>>              (like kMaxStackDepth with the "k" in front with some
>>>> unknown meaning)
>>>>              is fine.
>>>>
>>>>              (Chances are you will find that style somewhere used
>>>> anyway too,
>>>>              apologies if so :/)
>>>>
>>>>
>>>>         Thanks for that link, now I know where to look. I used the
>>>> upper camel case in my code as well then :) I should have gotten them all.
>>>>
>>>>
>>>>               > PS: I've also inlined my answers to Thomas below:
>>>>               >
>>>>               > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl
>>>> <thomas.schatzl at oracl
>>>>               > e.com <http://e.com> <http://e.com>> wrote:
>>>>               > > Hi all,
>>>>               > >
>>>>               > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
>>>>               > > > Dear all,
>>>>               > > >
>>>>               > > > I've continued working on this and have done the
>>>> following
>>>>               > > webrev:
>>>>               > > > http://cr.openjdk.java.net/~ra
>>>> sbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~r
>>>> asbold/8171119/webrev.05/>
>>>>         <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <
>>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>
>>>>
>>>>               > >
>>>>               > > [...]
>>>>               > > > Things I still need to do:
>>>>               > > >    - Have to fix that TLAB case for the
>>>> FastTLABRefill
>>>>               > > >    - Have to start looking at the data to see that
>>>> it is
>>>>               > > consistent and does gather the right samples, right
>>>> frequency, etc.
>>>>               > > >    - Have to check the GC elements and what that
>>>> produces
>>>>               > > >    - Run a slowdebug run and ensure I fixed all
>>>> those issues you
>>>>               > > saw > Robbin
>>>>               > > >
>>>>               > > > Thanks for looking at the webrev and have a great
>>>> week!
>>>>               > >
>>>>               > >   scratching a bit on the surface of this change, so
>>>> apologies for
>>>>               > > rather shallow comments:
>>>>               > >
>>>>               > > - macroAssembler_x86.cpp:5604: while this is compiler
>>>> code, and I
>>>>               > > am not sure this is final, please avoid littering the
>>>> code with
>>>>               > > TODO remarks :) They tend to be candidates for later
>>>> wtf moments
>>>>               > > only.
>>>>               > >
>>>>               > > Just file a CR for that.
>>>>               > >
>>>>               > Newcomer question: what is a CR and not sure I have the
>>>> rights to do
>>>>               > that yet ? :)
>>>>
>>>>              Apologies. CR is a change request, this suggests to file a
>>>> bug in the
>>>>              bug tracker. And you are right, you can't just create a
>>>> new account in
>>>>              the OpenJDK JIRA yourselves. :(
>>>>
>>>>
>>>>         Ok good to know, I'll continue with my own todo list but I'll
>>>> work hard on not letting it slip in the webrevs anymore :)
>>>>
>>>>
>>>>              I was mostly referring to the "... but it is a TODO" part
>>>> of that
>>>>              comment in macroassembler_x86.cpp. Comments about the why
>>>> of the code
>>>>              are appreciated.
>>>>
>>>>              [Note that I now understand that this is to some degree
>>>> still work in
>>>>              progress. As long as the final changeset does no contain
>>>> TODO's I am
>>>>              fine (and it's not a hard objection, rather their use in
>>>> "final" code
>>>>              is typically limited in my experience)]
>>>>
>>>>              5603   // Currently, if this happens, just set back the
>>>> actual end to
>>>>              where it was.
>>>>              5604   // We miss a chance to sample here.
>>>>
>>>>              Would be okay, if explaining "this" and the "why" of
>>>> missing a chance
>>>>              to sample here would be best.
>>>>
>>>>              Like maybe:
>>>>
>>>>              // If we needed to refill TLABs, just set the actual end
>>>> point to
>>>>              // the end of the TLAB again. We do not sample here
>>>> although we could.
>>>>
>>>>         Done with your comment, it works well in my mind.
>>>>
>>>>              I am not sure whether "miss a chance to sample" meant "we
>>>> could, but
>>>>              consciously don't because it's not that useful" or "it
>>>> would be
>>>>              necessary but don't because it's too complicated to do.".
>>>>
>>>>              Looking at the original comment once more, I am also not
>>>> sure if that
>>>>              comment shouldn't referring to the "end" variable (not
>>>> actual_end)
>>>>              because that's the variable that is responsible for taking
>>>> the sampling
>>>>              path? (Going from the member description of
>>>> ThreadLocalAllocBuffer).
>>>>
>>>>
>>>>         I've moved this code and it no longer shows up here but the
>>>> rationale and answer was:
>>>>
>>>>         So.. Yes, end is the variable provoking the sampling. Actual
>>>> end is the actual end of the TLAB.
>>>>
>>>>         What was happening here is that the code is resetting _end to
>>>> point towards the end of the new TLAB. Because, we now have the end for
>>>> sampling and _actual_end for
>>>>         the actual end, we need to update the actual_end as well.
>>>>
>>>>         Normally, were we to do the real work here, we would calculate
>>>> the (end - start) offset, then do:
>>>>
>>>>         - Set the new end to : start + (old_end - old_start)
>>>>         - Set the actual end like we do here now where it because it is
>>>> the actual end.
>>>>
>>>>         Why is this not done here now anymore?
>>>>             - I was still debating which path to take:
>>>>                - Do it in the fast refill code, it has its perks:
>>>>                    - In a world where fast refills are happening all
>>>> the time or a lot, we can augment there the code to do the sampling
>>>>                - Remember what we had as an end before leaving the
>>>> slowpath and check on return
>>>>                    - This is what I'm doing now, it removes the need to
>>>> go fix up all fast refill paths but if you remain in fast refill paths, you
>>>> won't get sampling. I
>>>>         have to think of the consequences of that, maybe a future
>>>> change later on?
>>>>                       - I have the statistics now so I'm going to study
>>>> that
>>>>                          -> By the way, though my statistics are
>>>> showing I'm missing some samples, if I turn off FastTlabRefill, it is the
>>>> same loss so for now, it seems
>>>>         this does not occur in my simple test.
>>>>
>>>>
>>>>
>>>>              But maybe I am only confused and it's best to just leave
>>>> the comment
>>>>              away. :)
>>>>
>>>>              Thinking about it some more, doesn't this not-sampling in
>>>> this case
>>>>              mean that sampling does not work in any collector that
>>>> does inline TLAB
>>>>              allocation at the moment? (Or is inline TLAB alloc
>>>> automatically
>>>>              disabled with sampling somehow?)
>>>>
>>>>              That would indeed be a bigger TODO then :)
>>>>
>>>>
>>>>         Agreed, this remark made me think that perhaps as a first step
>>>> the new way of doing it is better but I did have to:
>>>>            - Remove the const of the ThreadLocalBuffer remaining and
>>>> hard_end methods
>>>>            - Move hard_end out of the header file to have a bit more
>>>> logic there
>>>>
>>>>         Please let me know what you think of that and if you prefer it
>>>> this way or changing the fast refills. (I prefer this way now because it is
>>>> more incremental).
>>>>
>>>>
>>>>              > > - calling HeapMonitoring::do_weak_oops() (which should
>>>> probably be
>>>>              > > called weak_oops_do() like other similar methods) only
>>>> if string
>>>>              > > deduplication is enabled (in g1CollectedHeap.cpp:4511)
>>>> seems wrong.
>>>>              >
>>>>              > The call should be at least around 6 lines up outside
>>>> the if.
>>>>              >
>>>>              > Preferentially in a method like
>>>> process_weak_jni_handles(), including
>>>>              > additional logging. (No new (G1) gc phase without
>>>> minimal logging
>>>>              > :)).
>>>>              > Done but really not sure because:
>>>>              >
>>>>              > I put for logging:
>>>>              >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing
>>>> [other] : heap
>>>>              > monitoring");
>>>>
>>>>              I would think that "gc, ref" would be more appropriate log
>>>> tags for
>>>>              this similar to jni handles.
>>>>              (I am als not sure what weak reference handling has to do
>>>> with
>>>>              G1ConcRegionFreeing, so I am a bit puzzled)
>>>>
>>>>
>>>>         I was not sure what to put for the tags or really as the
>>>> message. I cleaned it up a bit now to:
>>>>              log_develop_trace(gc, ref)("HeapSampling [other] : heap
>>>> monitoring processing");
>>>>
>>>>
>>>>
>>>>              > Since weak_jni_handles didn't have logging for me to be
>>>> inspired
>>>>              > from, I did that but unconvinced this is what should be
>>>> done.
>>>>
>>>>              The JNI handle processing does have logging, but only in
>>>>              ReferenceProcessor::process_discovered_references(). In
>>>>              process_weak_jni_handles() only overall time is measured
>>>> (in a G1
>>>>              specific way, since only G1 supports disabling reference
>>>> procesing) :/
>>>>
>>>>              The code in ReferenceProcessor prints both time taken
>>>>              referenceProcessor.cpp:254, as well as the count, but
>>>> strangely only in
>>>>              debug VMs.
>>>>
>>>>              I have no idea why this logging is that unimportant to
>>>> only print that
>>>>              in a debug VM. However there are reviews out for changing
>>>> this area a
>>>>              bit, so it might be useful to wait for that (JDK-8173335).
>>>>
>>>>
>>>>         I cleaned it up a bit anyway and now it returns the count of
>>>> objects that are in the system.
>>>>
>>>>
>>>>              > > - the change doubles the size of
>>>>              > > CollectedHeap::allocate_from_tlab_slow() above the
>>>> "small and nice"
>>>>              > > threshold. Maybe it could be refactored a bit.
>>>>              > Done I think, it looks better to me :).
>>>>
>>>>              In ThreadLocalAllocBuffer::handle_sample() I think the
>>>>              set_back_actual_end()/pick_next_sample() calls could be
>>>> hoisted out of
>>>>              the "if" :)
>>>>
>>>>
>>>>         Done!
>>>>
>>>>
>>>>              > > - referenceProcessor.cpp:261: the change should add
>>>> logging about
>>>>              > > the number of references encountered, maybe after the
>>>> corresponding
>>>>              > > "JNI weak reference count" log message.
>>>>              > Just to double check, are you saying that you'd like to
>>>> have the heap
>>>>              > sampler to keep in store how many sampled objects were
>>>> encountered in
>>>>              > the HeapMonitoring::weak_oops_do?
>>>>              >    - Would a return of the method with the number of
>>>> handled
>>>>              > references and logging that work?
>>>>
>>>>              Yes, it's fine if HeapMonitoring::weak_oops_do() only
>>>> returned the
>>>>              number of processed weak oops.
>>>>
>>>>
>>>>         Done also (but I admit I have not tested the output yet) :)
>>>>
>>>>
>>>>              >    - Additionally, would you prefer it in a separate
>>>> block with its
>>>>              > GCTraceTime?
>>>>
>>>>              Yes. Both kinds of information is interesting: while the
>>>> time taken is
>>>>              typically more important, the next question would be why,
>>>> and the
>>>>              number of references typically goes a long way there.
>>>>
>>>>              See above though, it is probably best to wait a bit.
>>>>
>>>>
>>>>         Agreed that I "could" wait but, if it's ok, I'll just
>>>> refactor/remove this when we get closer to something final. Either,
>>>> JDK-8173335
>>>>         has gone in and I will notice it now or it will soon and I can
>>>> change it then.
>>>>
>>>>
>>>>              > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
>>>>              > Removed it and added it to my personal todos to look at.
>>>>              >      > >
>>>>              > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer
>>>> class
>>>>              > > documentation should be updated about the sampling
>>>> additions. I
>>>>              > > would have no clue what the difference between
>>>> "actual_end" and
>>>>              > > "end" would be from the given information.
>>>>              > If you are talking about the comments in this file, I
>>>> made them more
>>>>              > clear I hope in the new webrev. If it was somewhere
>>>> else, let me know
>>>>              > where to change.
>>>>
>>>>              Thanks, that's much better. Maybe a note in the comment of
>>>> the class
>>>>              that ThreadLocalBuffer provides some sampling facility by
>>>> modifying the
>>>>              end() of the TLAB to cause "frequent" calls into the
>>>> runtime call where
>>>>              actual sampling takes place.
>>>>
>>>>
>>>>         Done, I think it's better now. Added something about the
>>>> slow_path_end as well.
>>>>
>>>>
>>>>              > > - in heapMonitoring.hpp: there are some random
>>>> comments about some
>>>>              > > code that has been grabbed from
>>>> "util/math/fastmath.[h|cc]". I
>>>>              > > can't tell whether this is code that can be used but I
>>>> assume that
>>>>              > > Noam Shazeer is okay with that (i.e. that's all Google
>>>> code).
>>>>              > Jeremy and I double checked and we can release that as I
>>>> thought. I
>>>>              > removed the comment from that piece of code entirely.
>>>>
>>>>              Thanks.
>>>>
>>>>              > > - heapMonitoring.hpp/cpp static constant naming does
>>>> not correspond
>>>>              > > to Hotspot's. Additionally, in Hotspot static methods
>>>> are cased
>>>>              > > like other methods.
>>>>              > I think I fixed the methods to be cased the same way as
>>>> all other
>>>>              > methods. For static constants, I was not sure. I fixed a
>>>> few other
>>>>              > variables but I could not seem to really see a
>>>> consistent trend for
>>>>              > constants. I made them as variables but I'm not sure now.
>>>>
>>>>              Sorry again, style is a kind of mess. The goal of my
>>>> suggestions here
>>>>              is only to prevent yet another style creeping in.
>>>>
>>>>              > > - in heapMonitoring.cpp there are a few cryptic
>>>> comments at the top
>>>>              > > that seem to refer to internal stuff that should
>>>> probably be
>>>>              > > removed.
>>>>              > Sorry about that! My personal todos not cleared out.
>>>>
>>>>              I am happy about comments, but I simply did not understand
>>>> any of that
>>>>              and I do not know about other readers as well.
>>>>
>>>>              If you think you will remember removing/updating them
>>>> until the review
>>>>              proper (I misunderstood the review situation a little it
>>>> seems).
>>>>
>>>>              > > I did not think through the impact of the TLAB changes
>>>> on collector
>>>>              > > behavior yet (if there are). Also I did not check for
>>>> problems with
>>>>              > > concurrent mark and SATB/G1 (if there are).
>>>>              > I would love to know your thoughts on this, I think this
>>>> is fine. I
>>>>
>>>>              I think so too now. No objects are made live out of thin
>>>> air :)
>>>>
>>>>              > see issues with multiple threads right now hitting the
>>>> stack storage
>>>>              > instance. Previous webrevs had a mutex lock here but we
>>>> took it out
>>>>              > for simplificity (and only for now).
>>>>
>>>>              :) When looking at this after some thinking I now assume
>>>> for this
>>>>              review that this code is not MT safe at all. There seems
>>>> to be more
>>>>              synchronization missing than just the one for the
>>>> StackTraceStorage. So
>>>>              no comments about this here.
>>>>
>>>>
>>>>         I doubled checked a bit (quickly I admit) but it seems that
>>>> synchronization in StackTraceStorage is really all you need (all methods
>>>> lead to a StackTraceStorage one
>>>>         and can be multithreaded outside of that).
>>>>         There is a question about the initialization where the method
>>>> HeapMonitoring::initialize_profiling is not thread safe.
>>>>         It would work (famous last words) and not crash if there was a
>>>> race but we could add a synchronization point there as well (and therefore
>>>> on the stop as well).
>>>>
>>>>         But anyway I will really check and do this once we add back
>>>> synchronization.
>>>>
>>>>
>>>>              Also, this would require some kind of specification of
>>>> what is allowed
>>>>              to be called when and where.
>>>>
>>>>
>>>>         Would we specify this with the methods in the jvmti.xml file?
>>>> We could start by specifying in each that they are not thread safe but I
>>>> saw no mention of that for
>>>>         other methods.
>>>>
>>>>
>>>>              One potentially relevant observation about locking here:
>>>> depending on
>>>>              sampling frequency, StackTraceStore::add_trace() may be
>>>> rather
>>>>              frequently called. I assume that you are going to do
>>>> measurements :)
>>>>
>>>>
>>>>         Though we don't have the TLAB implementation in our code, the
>>>> compiler generated sampler uses 2% of overhead with a 512k sampling rate. I
>>>> can do real measurements
>>>>         when the code settles and we can see how costly this is as a
>>>> TLAB implementation.
>>>>         However, my theory is that if the rate is 512k, the
>>>> memory/performance overhead should be minimal since it is what we saw with
>>>> our code/workloads (though not called
>>>>         the same way, we call it essentially at the same rate).
>>>>         If you have a benchmark you'd like me to test, let me know!
>>>>
>>>>         Right now, with my really small test, this does use a bit of
>>>> overhead even for a 512k sample size. I don't know yet why, I'm going to
>>>> see what is going on.
>>>>
>>>>         Finally, I think it is not reasonable to suppose the overhead
>>>> to be negligible if the sampling rate used is too low. The user should know
>>>> that the lower the rate,
>>>>         the higher the overhead (documentation TODO?).
>>>>
>>>>
>>>>              I am not sure what the expected usage of the API is, but
>>>>              StackTraceStore::add_trace() seems to be able to grow
>>>> without bounds.
>>>>              Only a GC truncates them to the live ones. That in itself
>>>> seems to be
>>>>              problematic (GCs can be *wide* apart), and of course some
>>>> of the API
>>>>              methods add to that because they duplicate that unbounded
>>>> array. Do you
>>>>              have any concerns/measurements about this?
>>>>
>>>>
>>>>         So, the theory is that yes add_trace can be able to grow
>>>> without bounds but it grows at a sample per 512k of allocated space. The
>>>> stacks it gathers are currently
>>>>         maxed at 64 (I'd like to expand that to an option to the user
>>>> though at some point). So I have no concerns because:
>>>>
>>>>         - If really this is taking a lot of space, that means the job
>>>> is keeping a lot of objects in memory as well, therefore the entire heap is
>>>> getting huge
>>>>         - If this is the case, you will be triggering a GC at some
>>>> point anyway.
>>>>
>>>>         (I'm putting under the rug the issue of "What if we set the
>>>> rate to 1 for example" because as you lower the sampling rate, we cannot
>>>> guarantee low overhead; the
>>>>         idea behind this feature is to have a means of having
>>>> meaningful allocated samples at a low overhead)
>>>>
>>>>         I have no measurements really right now but since I now have
>>>> some statistics I can poll, I will look a bit more at this question.
>>>>
>>>>         I have the same last sentence than above: the user should
>>>> expect this to happen if the sampling rate is too small. That probably can
>>>> be reflected in the
>>>>         StartHeapSampling as a note : careful this might impact your
>>>> performance.
>>>>
>>>>
>>>>              Also, these stack traces might hold on to huge arrays. Any
>>>>              consideration of that? Particularly it might be the cause
>>>> for OOMEs in
>>>>              tight memory situations.
>>>>
>>>>
>>>>         There is a stack size maximum that is set to 64 so it should
>>>> not hold huge arrays. I don't think this is an issue but I can double check
>>>> with a test or two.
>>>>
>>>>
>>>>              - please consider adding a safepoint check in
>>>>              HeapMonitoring::weak_oops_do to prevent accidental misuse.
>>>>
>>>>              - in struct StackTraceStorage, the public fields may also
>>>> need
>>>>              underscores. At least some files in the runtime directory
>>>> have structs
>>>>              with underscored public members (and some don't). The
>>>> runtime team
>>>>              should probably comment on that.
>>>>
>>>>
>>>>         Agreed I did not know. I looked around and a lot of structs did
>>>> not have them it seemed so I left it as is. I will happily change it if
>>>> someone prefers (I was not
>>>>         sure if you really preferred or not, your sentence seemed to be
>>>> more a note of "this might need to change but I don't know if the runtime
>>>> team enforces that", let
>>>>         me know if I read that wrongly).
>>>>
>>>>
>>>>              - In StackTraceStorage::weak_oops_do(), when examining the
>>>>              StackTraceData, maybe it is useful to consider having a
>>>> non-NULL
>>>>              reference outside of the heap's reserved space an error.
>>>> There should
>>>>              be no oop outside of the heap's reserved space ever.
>>>>
>>>>              Unless you allow storing random values in
>>>> StackTraceData::obj, which I
>>>>              would not encourage.
>>>>
>>>>
>>>>         I suppose you are talking about this part:
>>>>         if ((value != NULL && Universe::heap()->is_in_reserved(value))
>>>> &&
>>>>                     (is_alive == NULL || is_alive->do_object_b(value)))
>>>> {
>>>>
>>>>         What you are saying is that I could have something like:
>>>>         if (value != my_non_null_reference &&
>>>>                     (is_alive == NULL || is_alive->do_object_b(value)))
>>>> {
>>>>
>>>>         Is that what you meant? Is there really a reason to do so? When
>>>> I look at the code, is_in_reserved seems like a O(1) method call. I'm not
>>>> even sure we can have a
>>>>         NULL value to be honest. I might have to study that to see if
>>>> this was not a paranoid test to begin with.
>>>>
>>>>         The is_alive code has now morphed due to the comment below.
>>>>
>>>>
>>>>
>>>>              - HeapMonitoring::weak_oops_do() does not seem to use the
>>>>              passed AbstractRefProcTaskExecutor.
>>>>
>>>>
>>>>         It did use it:
>>>>            size_t HeapMonitoring::weak_oops_do(
>>>>               AbstractRefProcTaskExecutor *task_executor,
>>>>               BoolObjectClosure* is_alive,
>>>>               OopClosure *f,
>>>>               VoidClosure *complete_gc) {
>>>>             assert(SafepointSynchronize::is_at_safepoint(), "must be
>>>> at safepoint");
>>>>
>>>>             if (task_executor != NULL) {
>>>>               task_executor->set_single_threaded_mode();
>>>>             }
>>>>             return StackTraceStorage::storage()->weak_oops_do(is_alive,
>>>> f, complete_gc);
>>>>         }
>>>>
>>>>         But due to the comment below, I refactored this, so this is no
>>>> longer here. Now I have an always true closure that is passed.
>>>>
>>>>
>>>>              - I do not understand allowing to call this method with a
>>>> NULL
>>>>              complete_gc closure. This would mean that objects
>>>> referenced from the
>>>>              object that is referenced by the StackTraceData are not
>>>> pulled, meaning
>>>>              they would get stale.
>>>>
>>>>              - same with is_alive parameter value of NULL
>>>>
>>>>
>>>>         So these questions made me look a bit closer at this code. This
>>>> code I think was written this way to have a very small impact on the file
>>>> but you are right, there
>>>>         is no reason for this here. I've simplified the code by making
>>>> in referenceProcessor.cpp a process_HeapSampling method that handles
>>>> everything there.
>>>>
>>>>         The code allowed NULLs because it depended on where you were
>>>> coming from and how the code was being called.
>>>>
>>>>         - I added a static always_true variable and pass that now to be
>>>> more consistent with the rest of the code.
>>>>         - I moved the complete_gc into process_phaseHeapSampling now
>>>> (new method) and handle the task_executor and the complete_gc there
>>>>              - Newbie question: in our code we did a
>>>> set_single_threaded_mode but I see that process_phaseJNI does it right
>>>> before its call, do I need to do it for the
>>>>         process_phaseHeapSample?
>>>>         That API is much cleaner (in my mind) and is consistent with
>>>> what is done around it (again in my mind).
>>>>
>>>>
>>>>              - heapMonitoring.cpp:590: I do not completely understand
>>>> the purpose of
>>>>              this code: in the end this results in a fixed value
>>>> directly dependent
>>>>              on the Thread address anyway? In the end this results in a
>>>> fixed value
>>>>              directly dependent on the Thread address anyway?
>>>>              IOW, what is special about exactly 20 rounds?
>>>>
>>>>
>>>>         So we really want a fast random number generator that has a
>>>> specific mean (512k is the default we use). The code uses the thread
>>>> address as the start number of the
>>>>         sequence (why not, it is random enough is rationale). Then
>>>> instead of just starting there, we prime the sequence and really only start
>>>> at the 21st number, it is
>>>>         arbitrary and I have not done a study to see if we could do
>>>> more or less of that.
>>>>
>>>>         As I have the statistics of the system up and running, I'll run
>>>> some experiments to see if this is needed, is 20 good, or not.
>>>>
>>>>
>>>>              - also I would consider stripping a few bits of the
>>>> threads' address as
>>>>              initialization value for your rng. The last three bits
>>>> (and probably
>>>>              more, check whether the Thread object is allocated on
>>>> special
>>>>              boundaries) are always zero for them.
>>>>              Not sure if the given "random" value is random enough
>>>> before/after,
>>>>              this method, so just skip that comment if you think this
>>>> is not
>>>>              required.
>>>>
>>>>
>>>>         I don't know is the honest answer. I think what is important is
>>>> that we tend towards a mean and it is random "enough" to not fall in
>>>> pitfalls of only sampling a
>>>>         subset of objects due to their allocation order. I added that
>>>> as test to do to see if it changes the mean in any way for the 512k default
>>>> value and/or if the first
>>>>         1000 elements look better.
>>>>
>>>>
>>>>              Some more random nits I did not find a place to put
>>>> anywhere:
>>>>
>>>>              - ThreadLocalAllocBuffer::_extra_space does not seem to
>>>> be used
>>>>              anywhere?
>>>>
>>>>
>>>>         Good catch :).
>>>>
>>>>
>>>>              - Maybe indent the declaration of
>>>> ThreadLocalAllocBuffer::_bytes_until_sample to align below the other
>>>> members of that group.
>>>>
>>>>
>>>>         Done moved it up a bit to have non static members together and
>>>> static separate.
>>>>
>>>>              Thanks,
>>>>                 Thomas
>>>>
>>>>
>>>>         Thanks for your review!
>>>>         Jc
>>>>
>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20170925/df1609cf/attachment-0001.html>