JFR thread sampling mechanism

Mon Jul 1 01:47:12 UTC 2019

[Redirected to hotspot-jfr-dev]

Note that the proposed mechanism would not "wake up" any threads
for sampling, ever. It would only ever signal a running thread that is
currently active and whose accumulated CPU consumption since the
last time it was sampled has reached the configured quanta (e.g.
10msec), and will tend to handle the signal on the same vcore that
is already running that thread. Without actually being currently
running, a thread could not accumulate the CPU time needed for
the trigger.

This mechanism is actually the cheapest possible thing for a kernel
to support (much much cheaper than wall clock time stuff), since the
scheduler is already intimately managing the running thread's CPU
time as part of time slicing and quantum management. Simply put,
the (per core) scheduler executes only one thread on the core at
a time, and when it context switches the thread away for any reason
(quanta ended, thread blocks, or some form of preemption) it adds
the execution time to that point to the CPU time consumed in the
thread. To implement a CPU time based timer, the scheduler simply
has to limit the quanta (the current "time slice" it would let a thread
run for before prempting it for no other reason) to the remaining time
on the shortest pending timer for that thread, and deal with checking
for a need to throw timer signals when the quanta expires (no polling
or testing during the time slice, and no need for any other check).

The rate at which you'd do the actual sampling on a 100% running
threads is configurable. I chose 10msec as an example because
it matches the highest possible rate that one thread could see under
the current defaults.

It's also possible to do all sorts of throttling on the total sampling rate,
and on the total timer signaling rate if needed. Without throttling,
1000 non-stop runnable threads on a 1 core instance will only do a
total of 100 samples per second, but on a 20 core instance they could
generate 2,000 samples per second. Adding a sample throttling (that
e.g. would simply avoid tracing individual samples if doing so would
 exceed the rate quota) would remedy that. The intuitive default
process-wide throttling limit should probably be 600/sec, for continuity
with the current implementation which would currently max out at that
sample generation rate by default (5 x java at 100/sec plus 1x native
at 100x per second).

— Gil.

> On Jun 30, 2019, at 3:08 PM, Kirk Pepperdine <kirk.pepperdine at gmail.com> wrote:
> 
> Hi Gil,
> 
> I would support an improvement in sampling as there is an obvious bias which allows me to write benchmarks where JFR completely misses things it should find. That said, I’m not sure that waking a thread up every 10ms is a great idea as it is very disruptive to Linux thread scheduling. I’d very much like to experiment with lower sampling rates.
> 
> Kind regards,
> Kirk
> 
> 
>> On Jun 30, 2019, at 9:20 AM, Gil Tene <gil at azul.com> wrote:
>> 
>> I would like to discuss a potential improvement to the JFR thread
>> sampling mechanism, and would like to see if the change we'd
>> propose has already been considered in the past.
>> 
>> I believe that the current thread sampling mechanism (mostly via
>> hotspot/share/jfr/periodic/sampling/jfrThreadSampler.cpp) can be
>> summarized as: A control thread wakes up periodically (e.g. 100
>> times per second) and in each period chooses a number (e.g. 5)
>> threads to sample (by rotating through the overall list of threads)
>> only if they are "in java", and a number (e.g. 1) threads (by
>> separately rotating through the overall list of threads) to sample
>> "only if it is in native". For each thread targeted to sample, the
>> control thread suspends the target thread (e.g. for linux this is
>> done by preparing a suspend request a sending a SIGUSR2 to
>> make the thread deal with it), takes a stacktrace of the suspended
>> thread, adds the stacktrace to JfrStackTraceRepository, and
>> resumes the thread (e.g. on linux resumption involves setting up
>> a resume request and again sending a SIGUSR2 to the thread to
>> get it to handle it and resume).
>> 
>> We've been contemplating a change to make thread sampling use
>> Posix timers instead, such that each thread would use a separate
>> timer, and threads would receive signals based on their CPU
>> consumption (the timer, e.g. created with timer_create(2), would
>> be clocked by the thread CPU time of their associated threads,
>> and signal their threads when that CPU time reaches a level
>> [of e.g. 10 msec]). The signal handler will then perform the
>> stacktrace sampling locally on the  thread, and deliver the
>> stacktrace to JfrStackTraceRepository (either directly or by
>> enqueing through an intermediary).
>> 
>> There are multiple potential benefits that may arise from switching
>> to such a scheme, including significant reduction of sampling cost,
>> improvement of density and focus of samples (fewer lost samples,
>> ensuring that enough activity in a given thread will end up leading
>> to a sample for that thread, etc.), and, potentially, an ability to
>> (with additional  changes) better account for time spent "outside
>> of java" in e.g. native and runtime code.
>> 
>> Has this (using thread-cpu-time-based posix timer sampling) been
>> considered before?
>>