RFR: 8326012: JFR: Event for safepoint timeout

Fri Feb 16 13:51:04 UTC 2024

There are now some JFR events related to safepoint. When time-to-safepoint (aka ttsp) is too long, these events could not be very helpful since based on them we cannot know which threads cause it and what those threads are doing.

Users can use `-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100` to see the threads that don't reach safepoint in time but without stack traces. Using `-XX:+ AbortVMOnSafepointTimeout` can capture the stack traces but it crashes the process, hence it's not sensible to enable the flag in production.

This patch adds a new JFR event `EventSafepoint` to record the threads that causes ttsp too long.

This event includes two fields:

- safepointId: the relevant safepoint id
- timeExceeded: the amount of time exceeding `SafepointTimeoutDelay` used by the thread to reach safepoint

In the current version, this event records the stack of those problematic threads when they finally reach safepoint. Hence, there is a bias, but it's still helpful to deduce the root place.

A better implementation is to record a more accurate stack, but this will increase complexity. At the same time, the native stack may also be important for this problem, but it is not currently supported by JFR.

Any input would be greatly appreciated.

-------------

Commit messages:
 - update
 - 8326012: JFR: Event for safepoint timeout

Changes: https://git.openjdk.org/jdk/pull/17888/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=17888&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8326012
  Stats: 143 lines in 7 files changed: 136 ins; 0 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/17888.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/17888/head:pull/17888

PR: https://git.openjdk.org/jdk/pull/17888