JFR thread sampling mechanism

Wed Jul 3 22:02:01 UTC 2019

Hi Gil,

thanks for the suggestion and idea.

The idea of using per thread timers seems like a very interesting approach, I don't think we have thought about it in that particular way. Per thread, timer based SIGPROF's in a way.

I agree there exist many platform specific (even hardware specific) solutions that provide for good (accurate, fast, lightweight) sampling measures indeed. A platform specific solution would probably be able to give better density / coverage compared to what is currently provided by the JFR sampling mechanism.

This must of course be weighed against a lot of other aspects we are trying to address, so here I will give some context / background:

Platform support:

The most important so far has been to have a single system that works (somewhat in the same vein) on all supported platforms (Windows, Linux, Mac, Sparc). Of course, when JFR was brought to Hotspot from JRockit, one aspect to address was if it was possible to reuse the existing AsyncGetCallTrace (Forte) support. Unfortunately,  there is no AsyncGetCallTrace support for Windows. So, instead of the Posix SIGPROF mechanism letting the threads themselves take their stack traces (which do not exist on Windows), we introduced a sampler thread to do the stack walking. By using the Windows Suspend/Resume APIs we could incorporate a sampling solution for this platform as well. We could reuse some of the existing stack walking code (shared by AsyncGetCallTrace) by incorporating a few customizations and then we had a single system with a uniform way of configuration.
There could be value in incorporating platform specific sampling mechanisms but that will of course come with a cost of higher maintenance and support. In addition, configuration aspects need to be addressed. We strive to have a common configuration across platforms. There could be divergence in having some platforms use a timer-based sampling system, with for example periodic events (sampling is a periodic event), the configured interval period would now be in CPU time on some platforms, but wall-clock on some others. Doable, but with more divergence and spread.

Overhead:

By having a fixed (relatively small) set of threads being sampled on each iteration, we limit the number of events generated to control for thread interrupts and size overheads. It is essentially a downsized, throttled, sampling mechanism, not designed for perfect coverage, but instead for "good" enough coverage in relation to other JFR events. Of course, throttling trades overhead for accuracy and you mentioned throttling in your suggestion as well.
Large amounts of JFR event data will generate buffer pressures and can induce more frequent chunk rotations (including safepoints). Also, at the side of the consumer (GUIs), the tooling need to be able to handle a much large amount of sampling events (and to render them effectively). As a specific example, we recently throttled down the number of "native" samples to only 1 thread per sample cycle (used to be 5). The reason was a very large amount of native sample events, taking up almost 50% of the recording size.

More Java samples would however be more beneficial for the end user, in comparison to these native samples just described. In relation to this, we recently decoupled the JFR sampler thread from the Periodic Task thread. This was done in an effort to support lower sampling interval configurations but not starve other Periodic Thread Tasks.
Another aspect to consider is that stack walking is expensive. All threads walking their own stacks could have a noticeable overhead, compared to a single thread walking a very small subset.

So, in summary - yes, it is a bit complicated, we need to weigh a lot of things here :)

Platform specific sampling must entail a much higher value compared to what currently exist in order for it to be considered in full (staffing / maintenance). It must adhere to the established JFR configuration model. Every platform specific sampling system must also be re-certified for the existing overhead targets. In addition, we have been and still are to some extent reluctant to introduce platform specific events (i.e. "LinuxSignalEvent") for example.
At the same time, of course, the ideal (one of them) is to provide as accurate information as we can, within acceptable boundaries for overhead. So, trade-offs, trade-offs...

Thanks
Markus

-----Original Message-----
From: Gil Tene <gil at azul.com> 
Sent: den 1 juli 2019 03:47
To: Kirk Pepperdine <kirk.pepperdine at gmail.com>
Cc: hotspot-jfr-dev at openjdk.java.net
Subject: Re: JFR thread sampling mechanism

[Redirected to hotspot-jfr-dev]

Note that the proposed mechanism would not "wake up" any threads for sampling, ever. It would only ever signal a running thread that is currently active and whose accumulated CPU consumption since the last time it was sampled has reached the configured quanta (e.g.
10msec), and will tend to handle the signal on the same vcore that is already running that thread. Without actually being currently running, a thread could not accumulate the CPU time needed for the trigger.

This mechanism is actually the cheapest possible thing for a kernel to support (much much cheaper than wall clock time stuff), since the scheduler is already intimately managing the running thread's CPU time as part of time slicing and quantum management. Simply put, the (per core) scheduler executes only one thread on the core at a time, and when it context switches the thread away for any reason (quanta ended, thread blocks, or some form of preemption) it adds the execution time to that point to the CPU time consumed in the thread. To implement a CPU time based timer, the scheduler simply has to limit the quanta (the current "time slice" it would let a thread run for before prempting it for no other reason) to the remaining time on the shortest pending timer for that thread, and deal with checking for a need to throw timer signals when the quanta expires (no polling or testing during the time slice, and no need for any other check).

The rate at which you'd do the actual sampling on a 100% running threads is configurable. I chose 10msec as an example because it matches the highest possible rate that one thread could see under the current defaults.

It's also possible to do all sorts of throttling on the total sampling rate, and on the total timer signaling rate if needed. Without throttling,
1000 non-stop runnable threads on a 1 core instance will only do a total of 100 samples per second, but on a 20 core instance they could generate 2,000 samples per second. Adding a sample throttling (that e.g. would simply avoid tracing individual samples if doing so would  exceed the rate quota) would remedy that. The intuitive default process-wide throttling limit should probably be 600/sec, for continuity with the current implementation which would currently max out at that sample generation rate by default (5 x java at 100/sec plus 1x native at 100x per second).

— Gil.

> On Jun 30, 2019, at 3:08 PM, Kirk Pepperdine <kirk.pepperdine at gmail.com> wrote:
> 
> Hi Gil,
> 
> I would support an improvement in sampling as there is an obvious bias which allows me to write benchmarks where JFR completely misses things it should find. That said, I’m not sure that waking a thread up every 10ms is a great idea as it is very disruptive to Linux thread scheduling. I’d very much like to experiment with lower sampling rates.
> 
> Kind regards,
> Kirk
> 
> 
>> On Jun 30, 2019, at 9:20 AM, Gil Tene <gil at azul.com> wrote:
>> 
>> I would like to discuss a potential improvement to the JFR thread 
>> sampling mechanism, and would like to see if the change we'd propose 
>> has already been considered in the past.
>> 
>> I believe that the current thread sampling mechanism (mostly via
>> hotspot/share/jfr/periodic/sampling/jfrThreadSampler.cpp) can be 
>> summarized as: A control thread wakes up periodically (e.g. 100 times 
>> per second) and in each period chooses a number (e.g. 5) threads to 
>> sample (by rotating through the overall list of threads) only if they 
>> are "in java", and a number (e.g. 1) threads (by separately rotating 
>> through the overall list of threads) to sample "only if it is in 
>> native". For each thread targeted to sample, the control thread 
>> suspends the target thread (e.g. for linux this is done by preparing 
>> a suspend request a sending a SIGUSR2 to make the thread deal with 
>> it), takes a stacktrace of the suspended thread, adds the stacktrace 
>> to JfrStackTraceRepository, and resumes the thread (e.g. on linux 
>> resumption involves setting up a resume request and again sending a 
>> SIGUSR2 to the thread to get it to handle it and resume).
>> 
>> We've been contemplating a change to make thread sampling use Posix 
>> timers instead, such that each thread would use a separate timer, and 
>> threads would receive signals based on their CPU consumption (the 
>> timer, e.g. created with timer_create(2), would be clocked by the 
>> thread CPU time of their associated threads, and signal their threads 
>> when that CPU time reaches a level [of e.g. 10 msec]). The signal 
>> handler will then perform the stacktrace sampling locally on the  
>> thread, and deliver the stacktrace to JfrStackTraceRepository (either 
>> directly or by enqueing through an intermediary).
>> 
>> There are multiple potential benefits that may arise from switching 
>> to such a scheme, including significant reduction of sampling cost, 
>> improvement of density and focus of samples (fewer lost samples, 
>> ensuring that enough activity in a given thread will end up leading 
>> to a sample for that thread, etc.), and, potentially, an ability to 
>> (with additional  changes) better account for time spent "outside of 
>> java" in e.g. native and runtime code.
>> 
>> Has this (using thread-cpu-time-based posix timer sampling) been 
>> considered before?
>>