RFR: 8342818: Implement JEP 509: JFR CPU-Time Profiling [v42]

Andrei Pangin apangin at openjdk.org
Wed Jun 4 03:15:36 UTC 2025


On Tue, 3 Jun 2025 14:09:29 GMT, Johannes Bechberger <jbechberger at openjdk.org> wrote:

>> This is the code for the [JEP 509: CPU Time based profiling for JFR](https://openjdk.org/jeps/509).
>> 
>> Currently tested using [this test suite](https://github.com/parttimenerd/basic-profiler-tests). This runs profiles the [Renaissance](https://renaissance.dev/) benchmark with
>> - ... different heap sizes
>> - ... different GCs
>> - ... different samplers (the standard JFR and the new CPU Time Sampler and both)
>> - ... different JFR recording durations
>> - ... different chunk-sizes
>
> Johannes Bechberger has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Rename autoadapt

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 51:

> 49: static bool is_excluded(JavaThread* jt) {
> 50:   return jt->is_hidden_from_external_view() ||
> 51:          jt->jfr_thread_local()->is_excluded() ||

These restrictions cause a large blind spot in observability. There is no technical limitation for recording cpu samples for internal threads too, even without a Java stack trace. Consider removing this restriction, although not in this PR.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 135:

> 133:   while ((new_lost_samples = Atomic::cmpxchg(&_lost_samples, lost_samples, 0)) != lost_samples) {
> 134:     lost_samples = new_lost_samples;
> 135:   }

Why not `Atomic::xchg`?

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 161:

> 159:     return 0;
> 160:   }
> 161:   return os::active_processor_count() * 1000000000.0 / rate;

If sampling period is configured as an absolute number in milliseconds, this value must be passed as is.
Double conversion via `Runtime.availableProcessors()` / `active_processor_count()` is unobvious and error-prone. First, because of asymmetry: e.g. `Runtime.availableProcessors()` may be redefined by an agent so that its value is not aligned with `active_processor_count()`. Second, because number of available processors may change at runtime, e.g., by adjusting cgroup quotas.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 198:

> 196:   virtual void post_run();
> 197: public:
> 198:   virtual const char* name() const { return "JFR CPU Time Thread Sampler"; }

Thread name is too long and does not sound right. Logically, it is not "Thread Sampler", but rather "Sampler Thread", which also aligns with the existing "JFR Sampler Thread". But I'd simplify it to `JFR CPU Time Sampler` or maybe `JFR CPU Sampler Thread`.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 202:

> 200:   void run();
> 201:   void on_javathread_create(JavaThread* thread);
> 202:   bool create_timer_for_thread(JavaThread* thread, timer_t &timerid);

Should it be `private`?

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 252:

> 250:   timer_delete(*timer);
> 251:   tl->unset_cpu_timer();
> 252:   tl->deallocate_cpu_time_jfr_queue();

Either this line is not needed or there is a possible resource leak: if `create_timer_for_thread` fails, queue is allocated but not deallocated.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 281:

> 279:     stop_timer();
> 280:     Atomic::store(&_stop_signals, true);
> 281:     while (Atomic::load_acquire(&_active_signal_handlers) > 0) {

There can be a race when `handle_timer_signal` has already passed `_stop_signals` check but has not yet incremented `_active_signal_handlers`.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 308:

> 306: 
> 307:     if (Atomic::load_acquire(&_is_async_processing_of_cpu_time_jfr_requests_triggered)) {
> 308:       Atomic::release_store(&_is_async_processing_of_cpu_time_jfr_requests_triggered, false);

acquire/release seem to be used for no good reason. Also, this could be a single `cmpxchg`.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 326:

> 324:       if (jt->thread_state() != _thread_in_native || !tl->try_acquire_cpu_time_jfr_dequeue_lock()) {
> 325:         tl->set_do_async_processing_of_cpu_time_jfr_requests(false);
> 326:         continue; // thread doesn't have a last Java frame or queue is already being processed

This comment may sound confusing, since `has_last_Java_frame` is checked separately below.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 472:

> 470: 
> 471: void handle_timer_signal(int signo, siginfo_t* info, void* context) {
> 472:   assert(_instance != nullptr, "invariant");

There can be an arbitrary delay in async signal delivery.
It's unlikely, but not impossible for `_instance` to be deleted by the time signal handler is called. There should be a better way to synchronize with JFR shutdown.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 477:

> 475: 
> 476: 
> 477: void JfrCPUTimeThreadSampling::handle_timer_signal(siginfo_t* info, void* context) {

It may be a good idea to validate `info->si_code` in order to protect from things like `kill -SIGPROF` after profiling has stopped. For a similar reason, `_sampler->_stop_signals` should default to `true` whenever profiler is not running.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 509:

> 507:   JfrCPUTimeTraceQueue& queue = tl->cpu_time_jfr_queue();
> 508:   if (!check_state(jt)) {
> 509:       queue.increment_lost_samples();

nit: wrong indent

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 576:

> 574:   }
> 575:   if (timer_create(clock, &sev, &t) < 0) {
> 576:     log_error(jfr)("Failed to register the signal handler for thread sampling: %s", os::strerror(os::get_last_error()));

If an application has many threads and current RLIMIT_SIGPENDING is low, logs will be flooded with this error message.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.cpp line 606:

> 604: void JfrCPUTimeThreadSampler::init_timers() {
> 605:   // install sig handler for sig
> 606:   PosixSignals::install_generic_signal_handler(SIG, (void*)::handle_timer_signal);

SIGPROF is also used by external profilers. Need to check if SIGPROF handler is already installed and warn user.

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.hpp line 58:

> 56:   volatile u4 _head;
> 57: 
> 58:   volatile s4 _lost_samples;

Why signed int? Can it be negative?

src/hotspot/share/jfr/periodic/sampling/jfrCPUTimeThreadSampler.hpp line 128:

> 126:   static void send_lost_event(const JfrTicks& time, traceid tid, s4 lost_samples);
> 127: 
> 128:   // Trigger sampling while a thread is not in a safepoint, from a seperate thread

typo: separate

src/hotspot/share/jfr/support/jfrThreadLocal.cpp line 558:

> 556: void JfrThreadLocal::set_cpu_timer(timer_t* timer) {
> 557:   if (_cpu_timer == nullptr) {
> 558:     _cpu_timer = JfrCHeapObj::new_array<timer_t>(1);

`timer_t` is a primitive type, at most one machine word. Why extra indirection and allocation?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124528320
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124503100
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125157311
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125128723
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125130332
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125190998
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125203700
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125342289
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125249171
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125230422
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125241099
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125320255
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125411074
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125430231
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2124507884
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125333563
PR Review Comment: https://git.openjdk.org/jdk/pull/25302#discussion_r2125183931


More information about the hotspot-dev mailing list