New heap allocation event proposal and PoC

Wed Sep 30 13:17:16 UTC 2020

Hello all,

  I would like to present our (Datadog) proposal of the new heap
allocation event proposal. The proposal is based on the writeup
created by Marcus Hirt and the followup discussion
(https://docs.google.com/document/d/191QzZIEPgOi-KGs82Sh9B6_dVtXudUonhdNrRgtt444/edit)

== Introduction
  Let me cite the rationale for the new heap allocation event from the
aforementioned document.
```
In JFR there are two allocation profiling events today - one for
allocations inside of thread local area buffers (TLABs) and one for
allocations outside. The events are quite useful for both allocation
profiling and for TLAB tuning. They are, however, quite hard to reason
about in terms of data production rate, and in certain edge cases both
the overhead and the data volume can be quite high. In always-on
production time profiling, arguably the most important domain for JFR,
these are quite serious drawbacks.
```

== Detailed description
  This proposal and (fully functional) PoC is based on the idea of
'subsampling' described in more detail in the linked document. The
subsampling is being performed by a rate-limiting, adaptive sampler
which allows keeping the event emission rate constant (more or less)
while providing a fairly accurate statistical picture of the heap
allocations happening in the system.

  The PoC is built upon the existing inside/outside TLAB hooks used by
the JFR. These hooks are used to get the 'raw' allocation samples -
basically each time the TLAB gets retired or outside-of-TLAB
allocation happens. These 'raw' samples are then pushed through the
adaptive sampler to generate the heap allocation events at the desired
rate.

=== PoC Sources and Binaries
  - Source: https://github.com/jbachorik/jdk/tree/allocation_sampling
  - Binaries; https://github.com/jbachorik/jdk/actions/runs/276358906

=== PoC Performance
  Initial perf assessment was done using the open source Renaissance
benchmark suite (https://renaissance.dev/) and namely the 'akka-uct'
benchmark. The benchmark is run on a dedicated EC2 c5.metal instance
with nothing else running concurrently. The benchmark app is given
40GB heap. The performance is described in the cumulative amount of
CPU time (kernel + user) reported by the 'time' command.
  The results are showing the CPU time overhead in range of 1% for
avg, p95 and p99, measured for akka-uct, scrabble and future-genetic
benchmark applications.

=== PoC Limitations
  The current implementation has the following limitations:
- the target rate is not configurable (currently set to 5k events per minute)
- the adaptive sampler implementation introduces a potential
contention point over a mutex (although that contention point should
not be hit more often than once per 10ms)
- the code structure may not be optimal

Thank you for your attention and looking forward to your comments and remarks!

-JB-