RFR: 8326012: JFR: Event for time to safepoint [v2]

Sun Feb 18 02:58:21 UTC 2024

On Fri, 16 Feb 2024 14:03:21 GMT, Denghui Dong <ddong at openjdk.org> wrote:

>> There are now some JFR events related to safepoint. When time-to-safepoint (aka ttsp) is too long, these events could not be very helpful since based on them we cannot know which threads cause it and what those threads are doing.
>> 
>> Users can use `-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100` to see the threads that don't reach safepoint in time but without stack traces. Using `-XX:+ AbortVMOnSafepointTimeout` can capture the stack traces but it crashes the process, hence it's not sensible to enable the flag in production.
>> 
>> ~~This patch adds a new JFR event `EventSafepointTimeout` to record the threads that cause ttsp too long.~~
>> 
>> ~~This event includes two fields:~~
>> 
>> ~~- safepointId: the relevant safepoint id~~
>> ~~- timeExceeded: the amount of time exceeding `SafepointTimeoutDelay` used by the thread to reach safepoint~~
>> 
>> ~~In the current version, this event records the stack of those problematic threads when they finally reach safepoint. Hence, there is a bias, but it's still helpful to deduce the root place.~~
>> 
>> A better implementation is to record a more accurate stack, but this will increase complexity. At the same time, the native stack may also be important for this problem, but it is not currently supported by JFR.
>> 
>> Any input would be greatly appreciated.
>> 
>> Testing: jdk/jdk/jfr
>
> Denghui Dong has updated the pull request incrementally with one additional commit since the last revision:
> 
>   remove debug code

Convert to draft.

I improved the implementation and updated the issue title.

This patch introduces a new event `TimeToSafepoint` with the following fields:

- `safepointId`: like other safepoint-related events

- `thread`: the target Java thread

- `iterations`: the number of state check iterations for the thread

The implementation:

- First collect the ttsp information during the thread state check iterations. Note that even if the threshold is 0, the number of events may not be equal to the number of Java threads since the information will not be collected in the first `for` loop (it's not necessary in my opinion). 

- Then emit events at the end of `SafepointSynchronize::begin`, where the safepoint has been reached and stack traces can be walked safely. There is a bias, but it's still helpful.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17888#issuecomment-1950053636
PR Comment: https://git.openjdk.org/jdk/pull/17888#issuecomment-1950848310