New heap allocation event proposal and PoC

Wed Sep 30 17:30:52 UTC 2020

Hello Jaroslav,

Nice to see progress on this!

The adaptive sampling approach is probably the most viable as it allows easy configuration while maintaining an upper bound on the overhead. It can also be useful for other events in the future, for example exception or method profiling. 

If we find that x events per minute causes too much overhead in the default configuration, we can reduce it to x / 5 events, with little impact on clients that consume the data. It will have a lower resolution, but the algorithm for a client to produce a report, let's say a top ten list of the intensive allocation sites, will be the same. 

I suggest that we create an option called rate, so users can specify something like this in a .jfc file:

<event name=“ObjectAllocationSample”>
    <setting name=“enabled”>true>/setting>
    <setting name=“stackTrace>true</setting>
    <setting name=“rate”>100</setting>
</event>

It could be that rate should include the window size, but it might be tricky to come up with a syntax, but perhaps “1000 / 10 s”? If multiple recordings are running at the same time, the smallest window size will be chosen, while not exceeding the highest rate. Or perhaps skip the window and go with 100 Hz? 

The option will only apply to events that support rate control, for now ObjectAllocationSampling

We want this event on by default and I am leaning towards having TLAB events disabled in profile jfc as well, but have it as a control attribute in the jfc, so users can enable it for troubleshooting in the JMC recording wizard.

Before looking at the implementation, could you produce a webrev on cr.openjdk.java.net?

Thanks
Erik

> On 30 Sep 2020, at 15:17, Jaroslav Bachorík <jaroslav.bachorik at datadoghq.com> wrote:
> 
> Hello all,
> 
>  I would like to present our (Datadog) proposal of the new heap
> allocation event proposal. The proposal is based on the writeup
> created by Marcus Hirt and the followup discussion
> (https://docs.google.com/document/d/191QzZIEPgOi-KGs82Sh9B6_dVtXudUonhdNrRgtt444/edit)
> 
> == Introduction
>  Let me cite the rationale for the new heap allocation event from the
> aforementioned document.
> ```
> In JFR there are two allocation profiling events today - one for
> allocations inside of thread local area buffers (TLABs) and one for
> allocations outside. The events are quite useful for both allocation
> profiling and for TLAB tuning. They are, however, quite hard to reason
> about in terms of data production rate, and in certain edge cases both
> the overhead and the data volume can be quite high. In always-on
> production time profiling, arguably the most important domain for JFR,
> these are quite serious drawbacks.
> ```
> 
> 
> == Detailed description
>  This proposal and (fully functional) PoC is based on the idea of
> 'subsampling' described in more detail in the linked document. The
> subsampling is being performed by a rate-limiting, adaptive sampler
> which allows keeping the event emission rate constant (more or less)
> while providing a fairly accurate statistical picture of the heap
> allocations happening in the system.
> 
>  The PoC is built upon the existing inside/outside TLAB hooks used by
> the JFR. These hooks are used to get the 'raw' allocation samples -
> basically each time the TLAB gets retired or outside-of-TLAB
> allocation happens. These 'raw' samples are then pushed through the
> adaptive sampler to generate the heap allocation events at the desired
> rate.
> 
> === PoC Sources and Binaries
>  - Source: https://github.com/jbachorik/jdk/tree/allocation_sampling
>  - Binaries; https://github.com/jbachorik/jdk/actions/runs/276358906
> 
> === PoC Performance
>  Initial perf assessment was done using the open source Renaissance
> benchmark suite (https://renaissance.dev/) and namely the 'akka-uct'
> benchmark. The benchmark is run on a dedicated EC2 c5.metal instance
> with nothing else running concurrently. The benchmark app is given
> 40GB heap. The performance is described in the cumulative amount of
> CPU time (kernel + user) reported by the 'time' command.
>  The results are showing the CPU time overhead in range of 1% for
> avg, p95 and p99, measured for akka-uct, scrabble and future-genetic
> benchmark applications.
> 
> === PoC Limitations
>  The current implementation has the following limitations:
> - the target rate is not configurable (currently set to 5k events per minute)
> - the adaptive sampler implementation introduces a potential
> contention point over a mutex (although that contention point should
> not be hit more often than once per 10ms)
> - the code structure may not be optimal
> 
> 
> Thank you for your attention and looking forward to your comments and remarks!
> 
> -JB-