Low-Overhead Heap Profiling

Mon Oct 16 15:46:07 UTC 2017

Hi JC,

I saw a webrev.12 in the directory, with only test changes(11->12), so I took that version.
I had a look and tested the tests, worked fine!

First glance at the code (looking at full v12) some minor things below, mostly unused stuff.

Thanks, Robbin

diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp

--- a/src/hotspot/share/runtime/heapMonitoring.cpp	Mon Oct 16 16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.cpp	Mon Oct 16 17:42:42 2017 +0200
@@ -211,2 +211,3 @@
    void initialize(int max_storage) {
+    // validate max_storage to sane value ? What would 0 mean ?
      MutexLocker mu(HeapMonitor_lock);
@@ -227,8 +228,4 @@
    bool initialized() { return _initialized; }
-  volatile bool *initialized_address() { return &_initialized; }

   private:
-  // Protects the traces currently sampled (below).
-  volatile intptr_t _stack_storage_lock[1];
-
    // The traces currently sampled.
@@ -313,3 +310,2 @@
    _initialized(false) {
-    _stack_storage_lock[0] = 0;
  }
@@ -532,13 +528,2 @@

-// Delegate the initialization question to the underlying storage system.
-bool HeapMonitoring::initialized() {
-  return StackTraceStorage::storage()->initialized();
-}
-
-// Delegate the initialization question to the underlying storage system.
-bool *HeapMonitoring::initialized_address() {
-  return
-      const_cast<bool*>(StackTraceStorage::storage()->initialized_address());
-}
-
  void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
--- a/src/hotspot/share/runtime/heapMonitoring.hpp	Mon Oct 16 16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.hpp	Mon Oct 16 17:42:42 2017 +0200
@@ -35,3 +35,2 @@
    static uint64_t _rnd;
-  static bool _initialized;
    static jint _monitoring_rate;
@@ -92,7 +91,2 @@

-  // Is the profiler initialized and where is the address to the initialized
-  // boolean.
-  static bool initialized();
-  static bool *initialized_address();
-
    // Called when o is to be sampled from a given thread and a given size.



On 10/10/2017 12:57 AM, JC Beyler wrote:
> Dear all,
> 
> Thread-safety is back!! Here is the update webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/
> 
> Full webrev is here:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/
> 
> In order to really test this, I needed to add this so thought now was a good time. It required a few changes here for the creation to ensure correctness and safety. Now we 
> keep the static pointer but clear the data internally so on re-initialize, it will be a bit more costly than before. I don't think this is a huge use-case so I did not 
> think it was a problem. I used the internal MutexLocker, I think I used it well, let me know.
> 
> I also added three tests:
> 
> 1) Stack depth test:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch
> 
> This test shows that the maximum stack depth system is working.
> 
> 2) Thread safety:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch
> 
> The test creates 24 threads and they all allocate at the same time. The test then checks it does find samples from all the threads.
> 
> 3) Thread on/off safety
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch
> 
> The test creates 24 threads that all allocate a bunch of memory. Then another thread turns the sampling on/off.
> 
> Btw, both tests 2 & 3 failed without the locks.
> 
> As I worked on this, I saw a lot of places where the tests are doing very similar things, I'm going to clean up the code a bit and make a HeapAllocator class that all tests 
> can call directly. This will greatly simplify the code.
> 
> Thanks for any comments/criticisms!
> Jc
> 
> 
> On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> 
>     Dear all,
> 
>     Small update to the webrev:
>     http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/>
> 
>     Full webrev is here:
>     http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/>
> 
>     I updated a bit of the naming, removed a TODO comment, and I added a test for testing the sampling rate. I also updated the maximum stack depth to 1024, there is no
>     reason to keep it so small. I did a micro benchmark that tests the overhead and it seems relatively the same.
> 
>     I compared allocations from a stack depth of 10 and allocations from a stack depth of 1024 (allocations are from the same helper method in
>     http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java
>     <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java>):
>                - For an array of 1 integer allocated in a loop; stack depth 1024 vs stack depth 10: 1% slower
>                - For an array of 200k integers allocated in a loop; stack depth 1024 vs stack depth 10: 3% slower
> 
>     So basically now moving the maximum stack depth to 1024 but we only copy over the stack depths actually used.
> 
>     For the next webrev, I will be adding a stack depth test to show that it works and probably put back the mutex locking so that we can see how difficult it is to keep
>     thread safe.
> 
>     Let me know what you think!
>     Jc
> 
> 
> 
>     On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> 
>         Forgot to say that for my numbers:
>           - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard
>         deviation:
>                1000:     1.28% average; 1.13% standard deviation
>                10000:    1.59% average; 1.25% standard deviation
>                100000:   1.26% average; 1.26% standard deviation
> 
>         The 1000/10000/100000 are the sizes of the arrays being allocated. These are allocated 100k times and the sampling rate is 111 times the size of the array.
> 
>         Thanks!
>         Jc
> 
> 
>         On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> 
>             Hi all,
> 
>             After a bit of a break, I am back working on this :). As before, here are two webrevs:
> 
>             - Full change set: http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/>
>             - Compared to version 8: http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/>
>                  (This version is compared to version 8 I last showed but ported to the new folder hierarchy)
> 
>             In this version I have:
>                - Handled Thomas' comments from his email of 07/03:
>                     - Merged the logging to be standard
>                     - Fixed up the code a bit where asked
>                     - Added some notes about the code not being thread-safe yet
>                 - Removed additional dead code from the version that modifies interpreter/c1/c2
>                 - Fixed compiler issues so that it compiles with --disable-precompiled-header
>                      - Tested with ./configure --with-boot-jdk=<jdk8> --with-debug-level=slowdebug --disable-precompiled-headers
> 
>             Additionally, I added a test to check the sanity of the sampler: HeapMonitorStatCorrectnessTest
>             (http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch>)
>                 - This allocates a number of arrays and checks that we obtain the number of samples we want with an accepted error of 5%. I tested it 100 times and it
>             passed everytime, I can test more if wanted
>                 - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and
>             standard deviation:
>                    1000:     1.28% average; 1.13% standard deviation
>                    10000:    1.59% average; 1.25% standard deviation
>                    100000:   1.26% average; 1.26% standard deviation
> 
>             What this means is that we were always at about 1~2% of the number of samples the test expected.
> 
>             Let me know what you think,
>             Jc
> 
>             On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> 
>                 Hi all,
> 
>                 I apologize, I have not yet handled your remarks but thought this new webrev would also be useful to see and comment on perhaps.
> 
>                 Here is the latest webrev, it is generated slightly different than the others since now I'm using webrev.ksh without the -N option:
>                 http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/>
> 
>                 And the webrev.07 to webrev.08 diff is here:
>                 http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/>
> 
>                 (Let me know if it works well)
> 
>                 It's a small change between versions but it:
>                    - provides a fix that makes the average sample rate correct (more on that below).
>                    - fixes the code to actually have it play nicely with the fast tlab refill
>                    - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
>                 - moved the capability to be onload solo
> 
>                 With this webrev, I've done a small study of the random number generator we use here for the sampling rate. I took a small program and it can be simplified to:
> 
>                 for (outer loop)
>                 for (inner loop)
>                 int[] tmp = new int[arraySize];
> 
>                 - I've fixed the outer and inner loops to being 800 for this experiment, meaning we allocate 640000 times an array of a given array size.
> 
>                 - Each program provides the average sample size used for the whole execution
> 
>                 - Then, I ran each variation 30 times and then calculated the average of the average sample size used for various array sizes. I selected the array size to
>                 be one of the following: 1, 10, 100, 1000.
> 
>                 - When compared to 512kb, the average sample size of 30 runs:
>                 1: 4.62% of error
>                 10: 3.09% of error
>                 100: 0.36% of error
>                 1000: 0.1% of error
>                 10000: 0.03% of error
> 
>                 What it shows is that, depending on the number of samples, the average does become better. This is because with an allocation of 1 element per array, it
>                 will take longer to hit one of the thresholds. This is seen by looking at the sample count statistic I put in. For the same number of iterations (800 *
>                 800), the different array sizes provoke:
>                 1: 62 samples
>                 10: 125 samples
>                 100: 788 samples
>                 1000: 6166 samples
>                 10000: 57721 samples
> 
>                 And of course, the more samples you have, the more sample rates you pick, which means that your average gets closer using that math.
> 
>                 Thanks,
>                 Jc
> 
>                 On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <jcbeyler at google.com <mailto:jcbeyler at google.com>> wrote:
> 
>                     Thanks Robbin,
> 
>                     This seems to have worked. When I have the next webrev ready, we will find out but I'm fairly confident it will work!
> 
>                     Thanks agian!
>                     Jc
> 
>                     On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <robbin.ehn at oracle.com <mailto:robbin.ehn at oracle.com>> wrote:
> 
>                         Hi JC,
> 
>                         On 06/29/2017 12:15 AM, JC Beyler wrote:
> 
>                             B) Incremental changes
> 
> 
>                         I guess the most common work flow here is using mq :
>                         hg qnew fix_v1
>                         edit files
>                         hg qrefresh
>                         hg qnew fix_v2
>                         edit files
>                         hg qrefresh
> 
>                         if you do hg log you will see 2 commits
> 
>                         webrev.ksh -r -2 -o my_inc_v1_v2
>                         webrev.ksh -o my_full_v2
> 
> 
>                         In  your .hgrc you might need:
>                         [extensions]
>                         mq =
> 
>                         /Robbin
> 
> 
>                             Again another newbiew question here...
> 
>                             For showing the incremental changes, is there a link that explains how to do that? I apologize for my newbie questions all the time :)
> 
>                             Right now, I do:
> 
>                                ksh ../webrev.ksh -m -N
> 
>                             That generates a webrev.zip and send it to Chuck Rasbold. He then uploads it to a new webrev.
> 
>                             I tried commiting my change and adding a small change. Then if I just do ksh ../webrev.ksh without any options, it seems to produce a similar
>                             page but now with only the changes I had (so the 06-07 comparison you were talking about) and a changeset that has it all. I imagine that is
>                             what you meant.
> 
>                             Which means that my workflow would become:
> 
>                             1) Make changes
>                             2) Make a webrev without any options to show just the differences with the tip
>                             3) Amend my changes to my local commit so that I have it done with
>                             4) Go to 1
> 
>                             Does that seem correct to you?
> 
>                             Note that when I do this, I only see the full change of a file in the full change set (Side note here: now the page says change set and not
>                             patch, which is maybe why Serguei was having issues?).
> 
>                             Thanks!
>                             Jc
> 
> 
> 
>                             On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <robbin.ehn at oracle.com <mailto:robbin.ehn at oracle.com> <mailto:robbin.ehn at oracle.com
>                             <mailto:robbin.ehn at oracle.com>>> wrote:
> 
>                                  Hi,
> 
>                                  On 06/28/2017 12:04 AM, JC Beyler wrote:
> 
>                                      Dear Thomas et al,
> 
>                                      Here is the newest webrev:
>                             http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>>
> 
> 
> 
>                                  You have some more bits to in there but generally this looks good and really nice with more tests.
>                                  I'll do and deep dive and re-test this when I get back from my long vacation with whatever patch version you have then.
> 
>                                  Also I think it's time you provide incremental (v06->07 changes) as well as complete change-sets.
> 
>                                  Thanks, Robbin
> 
> 
> 
> 
>                                      Thomas, I "think" I have answered all your remarks. The summary is:
> 
>                                      - The statistic system is up and provides insight on what the heap sampler is doing
>                                           - I've noticed that, though the sampling rate is at the right mean, we are missing some samples, I have not yet tracked out why
>                             (details below)
> 
>                                      - I've run a tiny benchmark that is the worse case: it is a very tight loop and allocated a small array
>                                           - In this case, I see no overhead when the system is off so that is a good start :)
>                                           - I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if
>                             this is consistent with our
>                                      internal implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.
> 
>                                         - The statistic system up is up and I have a new test
>                             http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
>                                      <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>>
>                                            - I did a bit of a study about the random generator here, more details are below but basically it seems to work well
> 
>                                         - I added a capability but since this is the first time doing this, I was not sure I did it right
>                                           - I did add a test though for it and the test seems to do what I expect (all methods are failing with the
>                             JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
>                                               -
>                             http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>
>                                     
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>>
> 
>                                         - I still need to figure out what to do about the multi-agent vs single-agent issue
> 
>                                         - As far as measurements, it seems I still need to look at:
>                                           - Why we do the 20 random calls first, are they necessary?
>                                           - Look at the mean of the sampling rate that the random generator does and also what is actually sampled
>                                           - What is the overhead in terms of memory/performance when on?
> 
>                                      I have inlined my answers, I think I got them all in the new webrev, let me know your thoughts.
> 
>                                      Thanks again!
>                                      Jc
> 
> 
>                                      On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>
>                             <mailto:thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>> <mailto:thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>
> 
>                                      <mailto:thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>>>> wrote:
> 
>                                           Hi,
> 
>                                           On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
>                                           > Hi all,
>                                           >
>                                           > First off: Thanks again to Robbin and Thomas for their reviews :)
>                                           >
>                                           > Next, I've uploaded a new webrev:
>                                           > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
>                                      <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>>
> 
>                                           >
>                                           > Here is an update:
>                                           >
>                                           > - @Robbin, I forgot to say that yes I need to look at implementing
>                                           > this for the other architectures and testing it before it is all
>                                           > ready to go. Is it common to have it working on all possible
>                                           > combinations or is there a subset that I should be doing first and we
>                                           > can do the others later?
>                                           > - I've tested slowdebug, built and ran the JTreg tests I wrote with
>                                           > slowdebug and fixed a few more issues
>                                           > - I've refactored a bit of the code following Thomas' comments
>                                           >    - I think I've handled all the comments from Thomas (I put
>                                           > comments inline below for the specifics)
> 
>                                           Thanks for handling all those.
> 
>                                           > - Following Thomas' comments on statistics, I want to add some
>                                           > quality assurance tests and find that the easiest way would be to
>                                           > have a few counters of what is happening in the sampler and expose
>                                           > that to the user.
>                                           >    - I'll be adding that in the next version if no one sees any
>                                           > objections to that.
>                                           >    - This will allow me to add a sanity test in JTreg about number of
>                                           > samples and average of sampling rate
>                                           >
>                                           > @Thomas: I had a few questions that I inlined below but I will
>                                           > summarize the "bigger ones" here:
>                                           >    - You mentioned constants are not using the right conventions, I
>                                           > looked around and didn't see any convention except normal naming then
>                                           > for static constants. Is that right?
> 
>                                           I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
>                             <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
>                                      <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
>                             <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>>
>                                           de and the rule is to "follow an existing pattern and must have a
>                                           distinct appearance from other names". Which does not help a lot I
>                                           guess :/ The GC team started using upper camel case, e.g.
>                                           SomeOtherConstant, but very likely this is probably not applied
>                                           consistently throughout. So I am fine with not adding another style
>                                           (like kMaxStackDepth with the "k" in front with some unknown meaning)
>                                           is fine.
> 
>                                           (Chances are you will find that style somewhere used anyway too,
>                                           apologies if so :/)
> 
> 
>                                      Thanks for that link, now I know where to look. I used the upper camel case in my code as well then :) I should have gotten them all.
> 
> 
>                                            > PS: I've also inlined my answers to Thomas below:
>                                            >
>                                            > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl at oracl
>                                            > e.com <http://e.com> <http://e.com> <http://e.com>> wrote:
>                                            > > Hi all,
>                                            > >
>                                            > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
>                                            > > > Dear all,
>                                            > > >
>                                            > > > I've continued working on this and have done the following
>                                            > > webrev:
>                                            > > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>
>                                      <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>>
> 
>                                            > >
>                                            > > [...]
>                                            > > > Things I still need to do:
>                                            > > >    - Have to fix that TLAB case for the FastTLABRefill
>                                            > > >    - Have to start looking at the data to see that it is
>                                            > > consistent and does gather the right samples, right frequency, etc.
>                                            > > >    - Have to check the GC elements and what that produces
>                                            > > >    - Run a slowdebug run and ensure I fixed all those issues you
>                                            > > saw > Robbin
>                                            > > >
>                                            > > > Thanks for looking at the webrev and have a great week!
>                                            > >
>                                            > >   scratching a bit on the surface of this change, so apologies for
>                                            > > rather shallow comments:
>                                            > >
>                                            > > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
>                                            > > am not sure this is final, please avoid littering the code with
>                                            > > TODO remarks :) They tend to be candidates for later wtf moments
>                                            > > only.
>                                            > >
>                                            > > Just file a CR for that.
>                                            > >
>                                            > Newcomer question: what is a CR and not sure I have the rights to do
>                                            > that yet ? :)
> 
>                                           Apologies. CR is a change request, this suggests to file a bug in the
>                                           bug tracker. And you are right, you can't just create a new account in
>                                           the OpenJDK JIRA yourselves. :(
> 
> 
>                                      Ok good to know, I'll continue with my own todo list but I'll work hard on not letting it slip in the webrevs anymore :)
> 
> 
>                                           I was mostly referring to the "... but it is a TODO" part of that
>                                           comment in macroassembler_x86.cpp. Comments about the why of the code
>                                           are appreciated.
> 
>                                           [Note that I now understand that this is to some degree still work in
>                                           progress. As long as the final changeset does no contain TODO's I am
>                                           fine (and it's not a hard objection, rather their use in "final" code
>                                           is typically limited in my experience)]
> 
>                                           5603   // Currently, if this happens, just set back the actual end to
>                                           where it was.
>                                           5604   // We miss a chance to sample here.
> 
>                                           Would be okay, if explaining "this" and the "why" of missing a chance
>                                           to sample here would be best.
> 
>                                           Like maybe:
> 
>                                           // If we needed to refill TLABs, just set the actual end point to
>                                           // the end of the TLAB again. We do not sample here although we could.
> 
>                                      Done with your comment, it works well in my mind.
> 
>                                           I am not sure whether "miss a chance to sample" meant "we could, but
>                                           consciously don't because it's not that useful" or "it would be
>                                           necessary but don't because it's too complicated to do.".
> 
>                                           Looking at the original comment once more, I am also not sure if that
>                                           comment shouldn't referring to the "end" variable (not actual_end)
>                                           because that's the variable that is responsible for taking the sampling
>                                           path? (Going from the member description of ThreadLocalAllocBuffer).
> 
> 
>                                      I've moved this code and it no longer shows up here but the rationale and answer was:
> 
>                                      So.. Yes, end is the variable provoking the sampling. Actual end is the actual end of the TLAB.
> 
>                                      What was happening here is that the code is resetting _end to point towards the end of the new TLAB. Because, we now have the end for
>                             sampling and _actual_end for
>                                      the actual end, we need to update the actual_end as well.
> 
>                                      Normally, were we to do the real work here, we would calculate the (end - start) offset, then do:
> 
>                                      - Set the new end to : start + (old_end - old_start)
>                                      - Set the actual end like we do here now where it because it is the actual end.
> 
>                                      Why is this not done here now anymore?
>                                          - I was still debating which path to take:
>                                             - Do it in the fast refill code, it has its perks:
>                                                 - In a world where fast refills are happening all the time or a lot, we can augment there the code to do the sampling
>                                             - Remember what we had as an end before leaving the slowpath and check on return
>                                                 - This is what I'm doing now, it removes the need to go fix up all fast refill paths but if you remain in fast refill paths,
>                             you won't get sampling. I
>                                      have to think of the consequences of that, maybe a future change later on?
>                                                    - I have the statistics now so I'm going to study that
>                                                       -> By the way, though my statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it is the same
>                             loss so for now, it seems
>                                      this does not occur in my simple test.
> 
> 
> 
>                                           But maybe I am only confused and it's best to just leave the comment
>                                           away. :)
> 
>                                           Thinking about it some more, doesn't this not-sampling in this case
>                                           mean that sampling does not work in any collector that does inline TLAB
>                                           allocation at the moment? (Or is inline TLAB alloc automatically
>                                           disabled with sampling somehow?)
> 
>                                           That would indeed be a bigger TODO then :)
> 
> 
>                                      Agreed, this remark made me think that perhaps as a first step the new way of doing it is better but I did have to:
>                                         - Remove the const of the ThreadLocalBuffer remaining and hard_end methods
>                                         - Move hard_end out of the header file to have a bit more logic there
> 
>                                      Please let me know what you think of that and if you prefer it this way or changing the fast refills. (I prefer this way now because it
>                             is more incremental).
> 
> 
>                                           > > - calling HeapMonitoring::do_weak_oops() (which should probably be
>                                           > > called weak_oops_do() like other similar methods) only if string
>                                           > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
>                                           >
>                                           > The call should be at least around 6 lines up outside the if.
>                                           >
>                                           > Preferentially in a method like process_weak_jni_handles(), including
>                                           > additional logging. (No new (G1) gc phase without minimal logging
>                                           > :)).
>                                           > Done but really not sure because:
>                                           >
>                                           > I put for logging:
>                                           >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
>                                           > monitoring");
> 
>                                           I would think that "gc, ref" would be more appropriate log tags for
>                                           this similar to jni handles.
>                                           (I am als not sure what weak reference handling has to do with
>                                           G1ConcRegionFreeing, so I am a bit puzzled)
> 
> 
>                                      I was not sure what to put for the tags or really as the message. I cleaned it up a bit now to:
>                                           log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring processing");
> 
> 
> 
>                                           > Since weak_jni_handles didn't have logging for me to be inspired
>                                           > from, I did that but unconvinced this is what should be done.
> 
>                                           The JNI handle processing does have logging, but only in
>                                           ReferenceProcessor::process_discovered_references(). In
>                                           process_weak_jni_handles() only overall time is measured (in a G1
>                                           specific way, since only G1 supports disabling reference procesing) :/
> 
>                                           The code in ReferenceProcessor prints both time taken
>                                           referenceProcessor.cpp:254, as well as the count, but strangely only in
>                                           debug VMs.
> 
>                                           I have no idea why this logging is that unimportant to only print that
>                                           in a debug VM. However there are reviews out for changing this area a
>                                           bit, so it might be useful to wait for that (JDK-8173335).
> 
> 
>                                      I cleaned it up a bit anyway and now it returns the count of objects that are in the system.
> 
> 
>                                           > > - the change doubles the size of
>                                           > > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
>                                           > > threshold. Maybe it could be refactored a bit.
>                                           > Done I think, it looks better to me :).
> 
>                                           In ThreadLocalAllocBuffer::handle_sample() I think the
>                                           set_back_actual_end()/pick_next_sample() calls could be hoisted out of
>                                           the "if" :)
> 
> 
>                                      Done!
> 
> 
>                                           > > - referenceProcessor.cpp:261: the change should add logging about
>                                           > > the number of references encountered, maybe after the corresponding
>                                           > > "JNI weak reference count" log message.
>                                           > Just to double check, are you saying that you'd like to have the heap
>                                           > sampler to keep in store how many sampled objects were encountered in
>                                           > the HeapMonitoring::weak_oops_do?
>                                           >    - Would a return of the method with the number of handled
>                                           > references and logging that work?
> 
>                                           Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
>                                           number of processed weak oops.
> 
> 
>                                      Done also (but I admit I have not tested the output yet) :)
> 
> 
>                                           >    - Additionally, would you prefer it in a separate block with its
>                                           > GCTraceTime?
> 
>                                           Yes. Both kinds of information is interesting: while the time taken is
>                                           typically more important, the next question would be why, and the
>                                           number of references typically goes a long way there.
> 
>                                           See above though, it is probably best to wait a bit.
> 
> 
>                                      Agreed that I "could" wait but, if it's ok, I'll just refactor/remove this when we get closer to something final. Either, JDK-8173335
>                                      has gone in and I will notice it now or it will soon and I can change it then.
> 
> 
>                                           > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
>                                           > Removed it and added it to my personal todos to look at.
>                                           >      > >
>                                           > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
>                                           > > documentation should be updated about the sampling additions. I
>                                           > > would have no clue what the difference between "actual_end" and
>                                           > > "end" would be from the given information.
>                                           > If you are talking about the comments in this file, I made them more
>                                           > clear I hope in the new webrev. If it was somewhere else, let me know
>                                           > where to change.
> 
>                                           Thanks, that's much better. Maybe a note in the comment of the class
>                                           that ThreadLocalBuffer provides some sampling facility by modifying the
>                                           end() of the TLAB to cause "frequent" calls into the runtime call where
>                                           actual sampling takes place.
> 
> 
>                                      Done, I think it's better now. Added something about the slow_path_end as well.
> 
> 
>                                           > > - in heapMonitoring.hpp: there are some random comments about some
>                                           > > code that has been grabbed from "util/math/fastmath.[h|cc]". I
>                                           > > can't tell whether this is code that can be used but I assume that
>                                           > > Noam Shazeer is okay with that (i.e. that's all Google code).
>                                           > Jeremy and I double checked and we can release that as I thought. I
>                                           > removed the comment from that piece of code entirely.
> 
>                                           Thanks.
> 
>                                           > > - heapMonitoring.hpp/cpp static constant naming does not correspond
>                                           > > to Hotspot's. Additionally, in Hotspot static methods are cased
>                                           > > like other methods.
>                                           > I think I fixed the methods to be cased the same way as all other
>                                           > methods. For static constants, I was not sure. I fixed a few other
>                                           > variables but I could not seem to really see a consistent trend for
>                                           > constants. I made them as variables but I'm not sure now.
> 
>                                           Sorry again, style is a kind of mess. The goal of my suggestions here
>                                           is only to prevent yet another style creeping in.
> 
>                                           > > - in heapMonitoring.cpp there are a few cryptic comments at the top
>                                           > > that seem to refer to internal stuff that should probably be
>                                           > > removed.
>                                           > Sorry about that! My personal todos not cleared out.
> 
>                                           I am happy about comments, but I simply did not understand any of that
>                                           and I do not know about other readers as well.
> 
>                                           If you think you will remember removing/updating them until the review
>                                           proper (I misunderstood the review situation a little it seems).
> 
>                                           > > I did not think through the impact of the TLAB changes on collector
>                                           > > behavior yet (if there are). Also I did not check for problems with
>                                           > > concurrent mark and SATB/G1 (if there are).
>                                           > I would love to know your thoughts on this, I think this is fine. I
> 
>                                           I think so too now. No objects are made live out of thin air :)
> 
>                                           > see issues with multiple threads right now hitting the stack storage
>                                           > instance. Previous webrevs had a mutex lock here but we took it out
>                                           > for simplificity (and only for now).
> 
>                                           :) When looking at this after some thinking I now assume for this
>                                           review that this code is not MT safe at all. There seems to be more
>                                           synchronization missing than just the one for the StackTraceStorage. So
>                                           no comments about this here.
> 
> 
>                                      I doubled checked a bit (quickly I admit) but it seems that synchronization in StackTraceStorage is really all you need (all methods
>                             lead to a StackTraceStorage one
>                                      and can be multithreaded outside of that).
>                                      There is a question about the initialization where the method HeapMonitoring::initialize_profiling is not thread safe.
>                                      It would work (famous last words) and not crash if there was a race but we could add a synchronization point there as well (and
>                             therefore on the stop as well).
> 
>                                      But anyway I will really check and do this once we add back synchronization.
> 
> 
>                                           Also, this would require some kind of specification of what is allowed
>                                           to be called when and where.
> 
> 
>                                      Would we specify this with the methods in the jvmti.xml file? We could start by specifying in each that they are not thread safe but I
>                             saw no mention of that for
>                                      other methods.
> 
> 
>                                           One potentially relevant observation about locking here: depending on
>                                           sampling frequency, StackTraceStore::add_trace() may be rather
>                                           frequently called. I assume that you are going to do measurements :)
> 
> 
>                                      Though we don't have the TLAB implementation in our code, the compiler generated sampler uses 2% of overhead with a 512k sampling rate.
>                             I can do real measurements
>                                      when the code settles and we can see how costly this is as a TLAB implementation.
>                                      However, my theory is that if the rate is 512k, the memory/performance overhead should be minimal since it is what we saw with our
>                             code/workloads (though not called
>                                      the same way, we call it essentially at the same rate).
>                                      If you have a benchmark you'd like me to test, let me know!
> 
>                                      Right now, with my really small test, this does use a bit of overhead even for a 512k sample size. I don't know yet why, I'm going to
>                             see what is going on.
> 
>                                      Finally, I think it is not reasonable to suppose the overhead to be negligible if the sampling rate used is too low. The user should
>                             know that the lower the rate,
>                                      the higher the overhead (documentation TODO?).
> 
> 
>                                           I am not sure what the expected usage of the API is, but
>                                           StackTraceStore::add_trace() seems to be able to grow without bounds.
>                                           Only a GC truncates them to the live ones. That in itself seems to be
>                                           problematic (GCs can be *wide* apart), and of course some of the API
>                                           methods add to that because they duplicate that unbounded array. Do you
>                                           have any concerns/measurements about this?
> 
> 
>                                      So, the theory is that yes add_trace can be able to grow without bounds but it grows at a sample per 512k of allocated space. The
>                             stacks it gathers are currently
>                                      maxed at 64 (I'd like to expand that to an option to the user though at some point). So I have no concerns because:
> 
>                                      - If really this is taking a lot of space, that means the job is keeping a lot of objects in memory as well, therefore the entire heap
>                             is getting huge
>                                      - If this is the case, you will be triggering a GC at some point anyway.
> 
>                                      (I'm putting under the rug the issue of "What if we set the rate to 1 for example" because as you lower the sampling rate, we cannot
>                             guarantee low overhead; the
>                                      idea behind this feature is to have a means of having meaningful allocated samples at a low overhead)
> 
>                                      I have no measurements really right now but since I now have some statistics I can poll, I will look a bit more at this question.
> 
>                                      I have the same last sentence than above: the user should expect this to happen if the sampling rate is too small. That probably can be
>                             reflected in the
>                                      StartHeapSampling as a note : careful this might impact your performance.
> 
> 
>                                           Also, these stack traces might hold on to huge arrays. Any
>                                           consideration of that? Particularly it might be the cause for OOMEs in
>                                           tight memory situations.
> 
> 
>                                      There is a stack size maximum that is set to 64 so it should not hold huge arrays. I don't think this is an issue but I can double
>                             check with a test or two.
> 
> 
>                                           - please consider adding a safepoint check in
>                                           HeapMonitoring::weak_oops_do to prevent accidental misuse.
> 
>                                           - in struct StackTraceStorage, the public fields may also need
>                                           underscores. At least some files in the runtime directory have structs
>                                           with underscored public members (and some don't). The runtime team
>                                           should probably comment on that.
> 
> 
>                                      Agreed I did not know. I looked around and a lot of structs did not have them it seemed so I left it as is. I will happily change it if
>                             someone prefers (I was not
>                                      sure if you really preferred or not, your sentence seemed to be more a note of "this might need to change but I don't know if the
>                             runtime team enforces that", let
>                                      me know if I read that wrongly).
> 
> 
>                                           - In StackTraceStorage::weak_oops_do(), when examining the
>                                           StackTraceData, maybe it is useful to consider having a non-NULL
>                                           reference outside of the heap's reserved space an error. There should
>                                           be no oop outside of the heap's reserved space ever.
> 
>                                           Unless you allow storing random values in StackTraceData::obj, which I
>                                           would not encourage.
> 
> 
>                                      I suppose you are talking about this part:
>                                      if ((value != NULL && Universe::heap()->is_in_reserved(value)) &&
>                                                  (is_alive == NULL || is_alive->do_object_b(value))) {
> 
>                                      What you are saying is that I could have something like:
>                                      if (value != my_non_null_reference &&
>                                                  (is_alive == NULL || is_alive->do_object_b(value))) {
> 
>                                      Is that what you meant? Is there really a reason to do so? When I look at the code, is_in_reserved seems like a O(1) method call. I'm
>                             not even sure we can have a
>                                      NULL value to be honest. I might have to study that to see if this was not a paranoid test to begin with.
> 
>                                      The is_alive code has now morphed due to the comment below.
> 
> 
> 
>                                           - HeapMonitoring::weak_oops_do() does not seem to use the
>                                           passed AbstractRefProcTaskExecutor.
> 
> 
>                                      It did use it:
>                                         size_t HeapMonitoring::weak_oops_do(
>                                            AbstractRefProcTaskExecutor *task_executor,
>                                            BoolObjectClosure* is_alive,
>                                            OopClosure *f,
>                                            VoidClosure *complete_gc) {
>                                          assert(SafepointSynchronize::is_at_safepoint(), "must be at safepoint");
> 
>                                          if (task_executor != NULL) {
>                                            task_executor->set_single_threaded_mode();
>                                          }
>                                          return StackTraceStorage::storage()->weak_oops_do(is_alive, f, complete_gc);
>                                      }
> 
>                                      But due to the comment below, I refactored this, so this is no longer here. Now I have an always true closure that is passed.
> 
> 
>                                           - I do not understand allowing to call this method with a NULL
>                                           complete_gc closure. This would mean that objects referenced from the
>                                           object that is referenced by the StackTraceData are not pulled, meaning
>                                           they would get stale.
> 
>                                           - same with is_alive parameter value of NULL
> 
> 
>                                      So these questions made me look a bit closer at this code. This code I think was written this way to have a very small impact on the
>                             file but you are right, there
>                                      is no reason for this here. I've simplified the code by making in referenceProcessor.cpp a process_HeapSampling method that handles
>                             everything there.
> 
>                                      The code allowed NULLs because it depended on where you were coming from and how the code was being called.
> 
>                                      - I added a static always_true variable and pass that now to be more consistent with the rest of the code.
>                                      - I moved the complete_gc into process_phaseHeapSampling now (new method) and handle the task_executor and the complete_gc there
>                                           - Newbie question: in our code we did a set_single_threaded_mode but I see that process_phaseJNI does it right before its call, do
>                             I need to do it for the
>                                      process_phaseHeapSample?
>                                      That API is much cleaner (in my mind) and is consistent with what is done around it (again in my mind).
> 
> 
>                                           - heapMonitoring.cpp:590: I do not completely understand the purpose of
>                                           this code: in the end this results in a fixed value directly dependent
>                                           on the Thread address anyway? In the end this results in a fixed value
>                                           directly dependent on the Thread address anyway?
>                                           IOW, what is special about exactly 20 rounds?
> 
> 
>                                      So we really want a fast random number generator that has a specific mean (512k is the default we use). The code uses the thread
>                             address as the start number of the
>                                      sequence (why not, it is random enough is rationale). Then instead of just starting there, we prime the sequence and really only start
>                             at the 21st number, it is
>                                      arbitrary and I have not done a study to see if we could do more or less of that.
> 
>                                      As I have the statistics of the system up and running, I'll run some experiments to see if this is needed, is 20 good, or not.
> 
> 
>                                           - also I would consider stripping a few bits of the threads' address as
>                                           initialization value for your rng. The last three bits (and probably
>                                           more, check whether the Thread object is allocated on special
>                                           boundaries) are always zero for them.
>                                           Not sure if the given "random" value is random enough before/after,
>                                           this method, so just skip that comment if you think this is not
>                                           required.
> 
> 
>                                      I don't know is the honest answer. I think what is important is that we tend towards a mean and it is random "enough" to not fall in
>                             pitfalls of only sampling a
>                                      subset of objects due to their allocation order. I added that as test to do to see if it changes the mean in any way for the 512k
>                             default value and/or if the first
>                                      1000 elements look better.
> 
> 
>                                           Some more random nits I did not find a place to put anywhere:
> 
>                                           - ThreadLocalAllocBuffer::_extra_space does not seem to be used
>                                           anywhere?
> 
> 
>                                      Good catch :).
> 
> 
>                                           - Maybe indent the declaration of ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of that group.
> 
> 
>                                      Done moved it up a bit to have non static members together and static separate.
> 
>                                           Thanks,
>                                              Thomas
> 
> 
>                                      Thanks for your review!
>                                      Jc
> 
> 
> 
> 
> 
> 
> 
>