Low-Overhead Heap Profiling

Mon Jul 3 15:25:12 UTC 2017

Hi,

  here's a first cut of my thoughts of the latest changes. The
compilation problems (at the bottom) prevented me to dig further into
it a bit.

Please strongly consider providing incremental webrevs, it takes some
effort to find changes otherwise.

On Tue, 2017-06-27 at 15:04 -0700, JC Beyler wrote:
> Dear Thomas et al,
> 
> Here is the newest webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/
> 
> Thomas, I "think" I have answered all your remarks. The summary is:
> 
> - The statistic system is up and provides insight on what the heap
> sampler is doing
>    - I've noticed that, though the sampling rate is at the right
> mean, we are missing some samples, I have not yet tracked out why
> (details below)
> - I've run a tiny benchmark that is the worse case: it is a very
> tight loop and allocated a small array
>    - In this case, I see no overhead when the system is off so that
> is a good start :)
>    - I see right now a high overhead in this case when sampling is
> on. This is not a really too surprising but I'm going to see if this
> is consistent with our internal implementation. The benchmark is
> really allocation stressful so I'm not too surprised but I want to do
> the due diligence.

I am mostly curious in what happens if you run heap sampling with an
otherwise "reasonable" sampling rate on very large machines (e.g. SPARC
M6/M7 with 2k threads)... costs of single locks tend to explode when
running such pieces of code on such machines.

But yes, that is probably not a use case we should worry too much about
here.

>  - The statistic system up is up and I have a new test http://cr.open
> jdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/Hea
> pMonitor/MyPackage/HeapMonitorStatTest.java.patch
>     - I did a bit of a study about the random generator here, more
> details are below but basically it seems to work well

I think the RNG itself is fine, it uses a well-known algorithm. I was
only doubting the need for the 20 iterations "priming" it. I.e. do the
the first 20 random numbers really have a different distribution than
the next 20?

> > I am not sure whether "miss a chance to sample" meant "we could,
> > but consciously don't because it's not that useful" or "it would be
> > necessary but don't because it's too complicated to do.".
> > 
> > Looking at the original comment once more, I am also not sure if
> > that comment shouldn't referring to the "end" variable (not
> > actual_end) because that's the variable that is responsible for
> > taking the sampling path? (Going from the member description of
> > ThreadLocalAllocBuffer).
> > I've moved this code and it no longer shows up here but the 
> > rationale and answer was:
> 
> So.. Yes, end is the variable provoking the sampling. Actual end is
> the actual end of the TLAB.
> 
> What was happening here is that the code is resetting _end to point
> towards the end of the new TLAB. Because, we now have the end for
> sampling and _actual_end for the actual end, we need to update the
> actual_end as well.
> 
> Normally, were we to do the real work here, we would calculate the
> (end - start) offset, then do:
> 
> - Set the new end to : start + (old_end - old_start)
> - Set the actual end like we do here now where it because it is the
> actual end.
> 
> Why is this not done here now anymore?
>   - I was still debating which path to take:
>      - Do it in the fast refill code, it has its perks:
>          - In a world where fast refills are happening all the time
> or a lot, we can augment there the code to do the sampling
>      - Remember what we had as an end before leaving the slowpath and
> check on return
>          - This is what I'm doing now, it removes the need to go fix
> up all fast refill paths but if you remain in fast refill paths, you
> won't get sampling. I have to think of the consequences of that,
> maybe a future change later on?
>             - I have the statistics now so I'm going to study that
>                -> By the way, though my statistics are showing I'm
> missing some samples, if I turn off FastTlabRefill, it is the same
> loss so for now, it seems this does not occur in my simple test.

For all but G1 fast TLAB refill is the default. Threads basically never
go into the runtime for allocation as allocation itself is easy there
as eden is always a contiguous range of memory.

There may be a fast TLAB refill path for G1 in some distant future too
(it can be done), particularly if we wanted G1 to scale down a bit
better. Fast TLAB refill tends to have its main advantages with
smallish TLABs, and that tends to happen with smaller heaps the most as
indicated in the other email.

> > But maybe I am only confused and it's best to just leave the
> > comment away. :)
> > 
> > Thinking about it some more, doesn't this not-sampling in this case
> > mean that sampling does not work in any collector that does inline
> > TLAB allocation at the moment? (Or is inline TLAB alloc
> > automatically disabled with sampling somehow?) 
> > That would indeed be a bigger TODO then :)
>
> Agreed, this remark made me think that perhaps as a first step the
> new way of doing it is better but I did have to:
>  - Remove the const of the ThreadLocalBuffer remaining and hard_end
> methods
>  - Move hard_end out of the header file to have a bit more logic
> there
> 
> Please let me know what you think of that and if you prefer it this
> way or changing the fast refills. (I prefer this way now because it
> is more incremental).

I have no real opinion here, however it feels (and as far as I
understand the code) that unless FastTLABRefill is disabled when doing
memory sampling using this implementation, a significant loss of
samples will occur.

> > > > - calling HeapMonitoring::do_weak_oops() (which should probably
> > > > be called weak_oops_do() like other similar methods) only if
> > > > string
> > > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems
> > wrong.
> > >
> > > The call should be at least around 6 lines up outside the if.
> > >
> > > Preferentially in a method like process_weak_jni_handles(),
> > including
> > > additional logging. (No new (G1) gc phase without minimal logging
> > > :)).
> > > Done but really not sure because:
> > >
> > > I put for logging:
> > >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] :
> > heap
> > > monitoring");
> > 
> > I would think that "gc, ref" would be more appropriate log tags for
> > this similar to jni handles.
> > (I am als not sure what weak reference handling has to do with
> > G1ConcRegionFreeing, so I am a bit puzzled)
> I was not sure what to put for the tags or really as the message. I
> cleaned it up a bit now to:
>    log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring
> processing");
> 

As for log messages, there are no strict guidelines. However, I
recommend being consistent with the messages though. In this case the
code adds three messages: 

g1CollectedHeap.cpp:4292: "HeapSampling [other] : heap monitoring
processing")

referenceProcessor.cpp:263: "Heap Monitoring Weak Reference"

referenceProcessor.cpp:270: "Heap Sampler Weak Reference handled:"

I would prefer something with "Heap Monitoring" in some way because the
main class is called that way as well.

(Maybe even changing everything to "Heap Object Sampling/Allocation
Sampling" would be more concise than "Heap Monitoring" it seems, but
that's just a weird idea).

> > >    - Additionally, would you prefer it in a separate block with 
> > > its GCTraceTime?
> > 
> > Yes. Both kinds of information is interesting: while the time taken
> > is typically more important, the next question would be why, and
> > the number of references typically goes a long way there.
> > 
> > See above though, it is probably best to wait a bit.
> Agreed that I "could" wait but, if it's ok, I'll just refactor/remove
> this when we get closer to something final. Either, JDK-8173335
> has gone in and I will notice it now or it will soon and I can change
> it then.

Okay.

[...]
> > :) When looking at this after some thinking I now assume for this
> > review that this code is not MT safe at all. There seems to be more
> > synchronization missing than just the one for the
> > StackTraceStorage. So no comments about this here.
>
> I doubled checked a bit (quickly I admit) but it seems that
> synchronization in StackTraceStorage is really all you need (all
> methods lead to a StackTraceStorage one and can be multithreaded
> outside of that).
> There is a question about the initialization where the
> method HeapMonitoring::initialize_profiling is not thread safe. 
> It would work (famous last words) and not crash if there was a race
> but we could add a synchronization point there as well (and therefore
> on the stop as well).

That method has been my main concern:
HeapMonitoring::initialize_profiling() calls
StackTraceStorage::initialize_stack_trace_storage(), and that one frees
some memory (and I would prefer if the single-liner in
StackTraceStore::reset_stack_trace_storage() would be broken up), that
is then potentially lazily recreated in StraceTraceStorage::storage(),
while multiple threads access it.

I can imagine a lot of ways that could go wrong actually.

> But anyway I will really check and do this once we add back
> synchronization.
>  

Okay, and at that time I will review that part :P

> > 
> > Also, this would require some kind of specification of what is
> > allowed to be called when and where.
> Would we specify this with the methods in the jvmti.xml file? We
> could start by specifying in each that they are not thread safe but I
> saw no mention of that for other methods.

Well, the various static methods in HeapMonitoring *could* get at least
some minimal documentation including this information ;)

> > One potentially relevant observation about locking here: depending
> > on sampling frequency, StackTraceStore::add_trace() may be rather
> > frequently called. I assume that you are going to do measurements
> > :)
> Though we don't have the TLAB implementation in our code, the
> compiler generated sampler uses 2% of overhead with a 512k sampling
> rate. I can do real measurements when the code settles and we can see
> how costly this is as a TLAB implementation.

Which collector? Please specify, as the default collector (G1) is
different than all others.

> However, my theory is that if the rate is 512k, the 
> memory/performance overhead should be minimal since it is what we saw
> with our code/workloads (though not called the same way, we call it
> essentially at the same rate).
> If you have a benchmark you'd like me to test, let me know!

Something allocating a lot, e.g. specjbb*? Also try with small region
size (1M) and smallish (fixed size, e.g. -XX:-ResizeTLAB
-XX:TLABSize=something) TLABs.

The test application might not be representative and not cause a lot of
calls into the runtime because its object allocation is very regular,
and so the existing TLAB (sizing) mechanism works really well, making
them huge.

gc+tlab=trace might give some information about TLAB usage.

> Right now, with my really small test, this does use a bit of overhead
> even for a 512k sample size. I don't know yet why, I'm going to see
> what is going on.
> 
> Finally, I think it is not reasonable to suppose the overhead to be
> negligible if the sampling rate used is too low. The user should know
> that the lower the rate, the higher the overhead (documentation
> TODO?). 

That's certainly something to point out.

> > I am not sure what the expected usage of the API is, but
> > StackTraceStore::add_trace() seems to be able to grow without
> > bounds.
> > Only a GC truncates them to the live ones. That in itself seems to
> > be problematic (GCs can be *wide* apart), and of course some of the
> > API methods add to that because they duplicate that unbounded
> > array. Do you have any concerns/measurements about this?
> So, the theory is that yes add_trace can be able to grow without
> bounds but it grows at a sample per 512k of allocated space. The 

[...]

> (I'm putting under the rug the issue of "What if we set the rate to 1
> for example" because as you lower the sampling rate, we cannot
> guarantee low overhead; the idea behind this feature is to have a
> means of having meaningful allocated samples at a low overhead)

The heap monitoring could print a log message at startup showing its
configuration btw. Not sure if useful. Or, some statistics at every GC
about actual sampling rate (e.g. in s^-1), internal memory usage and
such.

Would help with issue diagnosis quite a bit.

> I have no measurements really right now but since I now have some
> statistics I can poll, I will look a bit more at this question.
> 
> I have the same last sentence than above: the user should expect this
> to happen if the sampling rate is too small. That probably can be
> reflected in the StartHeapSampling as a note : careful this might
> impact your performance.

Note that some (minimal) amount of code is executed for every TLAB
refill. So if TLABs are small, there might be an issue regardless of
sampling rate (which I do not believe).

> > Also, these stack traces might hold on to huge arrays. Any
> > consideration of that? Particularly it might be the cause for OOMEs
> > in tight memory situations.
> There is a stack size maximum that is set to 64 so it should not hold
> huge arrays. I don't think this is an issue but I can double check
> with a test or two.

I meant the objects the stack traces reference. :)

But looking at the change, I think it does not catch large object
allocations at all, or allocations outside TLAB, only allocations
within TLABs.

Still, on G1, with 32M region size you may get a 16M array of j.l.O. as
sample... I do not know the limits for large objects for the other
collectors, but they may just allocate them into eden space as long as
it fits (the eden is contiguous for them after all).

> > - please consider adding a safepoint check in
> > HeapMonitoring::weak_oops_do to prevent accidental misuse.
> > 
> > - in struct StackTraceStorage, the public fields may also need
> > underscores. At least some files in the runtime directory have
> > structs
> > with underscored public members (and some don't). The runtime team
> > should probably comment on that.
> Agreed I did not know. I looked around and a lot of structs did not
> have them it seemed so I left it as is. I will happily change it if
> someone prefers (I was not sure if you really preferred or not, your
> sentence seemed to be more a note of "this might need to change but I
> don't know if the runtime team enforces that", let me know if I read
> that wrongly).

That's the correct interpretation.

> > - In StackTraceStorage::weak_oops_do(), when examining the
> > StackTraceData, maybe it is useful to consider having a non-NULL
> > reference outside of the heap's reserved space an error. There
> > should be no oop outside of the heap's reserved space ever.
> > 
> > Unless you allow storing random values in StackTraceData::obj,
> > which I would not encourage.
> I suppose you are talking about this part:
> if ((value != NULL && Universe::heap()->is_in_reserved(value)) &&
>           (is_alive == NULL || is_alive->do_object_b(value))) {
> 
> What you are saying is that I could have something like:
> if (value != my_non_null_reference && 
>           (is_alive == NULL || is_alive->do_object_b(value))) {
> 
> Is that what you meant? Is there really a reason to do so? When I
> look at the code, is_in_reserved seems like a O(1) method call. I'm
> not even sure we can have a NULL value to be honest. I might have to 

The "not being sure" part makes me nervous, not the performance
implications. I recommend not adding code allowing situations that "you
are not sure" to creep into the code. Just guarantee/assert that value
must be non-NULL proactively. If it crashes during testing, well then
we know for sure and can investigate whether it is a valid case (and
add the necessary code) ;)

> study that to see if this was not a paranoid test to begin with.
> 
> The is_alive code has now morphed due to the comment below.
> 
>  
> > 
> > - HeapMonitoring::weak_oops_do() does not seem to use the
> > passed AbstractRefProcTaskExecutor.
> It did use it:
>  size_t HeapMonitoring::weak_oops_do(
>     AbstractRefProcTaskExecutor *task_executor,
>     BoolObjectClosure* is_alive,
>     OopClosure *f,
>     VoidClosure *complete_gc) {
>   assert(SafepointSynchronize::is_at_safepoint(), "must be at
> safepoint");
> 
>   if (task_executor != NULL) {
>     task_executor->set_single_threaded_mode();
>   }
>   return StackTraceStorage::storage()->weak_oops_do(is_alive, f,
> complete_gc);
> }

Using something for no reason is "not using something" in my definition
:) The whole use of the task_executor seems superfluous.

The task_executor is used to distribute java reference processing work
across threads in a collector agnostic way (because there are two MT
work processing frameworks in the Hotspot code).

The non-java reference processing (JNI handles, heap sampling) do not
do that work in an MT way, and do not use the task_executor. [Actually,
the two calls to JNI handle/heap sampling should be moved out from the
ReferenceProcessor in the future, to let ReferenceProcessor only handle
Java level references. Yes, and I just saw that this code is just a
copy&paste of existing JNI handle code]

> But due to the comment below, I refactored this, so this is no longer
> here. Now I have an always true closure that is passed.
> 
> > 
> > - I do not understand allowing to call this method with a NULL
> > complete_gc closure. This would mean that objects referenced from
> > the object that is referenced by the StackTraceData are not pulled,
> > meaning they would get stale.
> > 
> > - same with is_alive parameter value of NULL
> So these questions made me look a bit closer at this code. This code
> I think was written this way to have a very small impact on the file
> but you are right, there is no reason for this here. I've simplified
> the code by making in referenceProcessor.cpp a process_HeapSampling
> method that handles everything there.
> 
> The code allowed NULLs because it depended on where you were coming
> from and how the code was being called.
> 
> - I added a static always_true variable and pass that now to be more
> consistent with the rest of the code.
> - I moved the complete_gc into process_phaseHeapSampling now (new
> method) and handle the task_executor and the complete_gc there

Still, setting the task_executor to single threaded processing there is
superfluous. The following code does not use the task_executor any
more...

>    - Newbie question: in our code we did a set_single_threaded_mode
> but I see that process_phaseJNI does it right before its call, do I
> need to do it for the process_phaseHeapSample?

Unneeded in both cases :)

> That API is much cleaner (in my mind) and is consistent with what is
> done around it (again in my mind).
> 

Agree.

Other issues, after some attempts at compilation:

 - the change does not seem to build without precompiled headers ;)
(heapMonitoring.cpp includes forte.hpp which does not e.g. include
memory/allocation.hpp; there are a few more of these issues - strictly
not your fault, but still to be fixed maybe in advance in separate
changes)

To disable precompiled headers, configure with --disable-precompiled-
headers (if you are using jib, use -- --disable-precompiled-headers).

 - heapMonitoring.hpp/cpp apparently need a few more includes to other
hotspot code (compiler complaining when compiling without precompiled
headers). Please look through the code and add them as necessary.

Thanks,
  Thomas