RFR: 8284828: Use `os::ThreadCrashProtection` to protect AsyncGetCallTrace from crashing [v6]

Ludovic Henry luhenry at openjdk.java.net
Thu Apr 14 13:55:59 UTC 2022


On Thu, 14 Apr 2022 13:06:58 GMT, Johannes Bechberger <duke at openjdk.java.net> wrote:

>> Move the AsyncGetCallTrace method implementation into a separate method and wrap its call in non-assert compilation mode in `os::ThreadCrashProtection` like it is done in [JFR](https://github.com/openjdk/jdk/blob/965ea8d9cd29aee41ba2b1b0b0c67bb67eca22dd/src/hotspot/share/jfr/periodic/sampling/jfrThreadSampler.cpp#L165).
>> This prevents AsyncGetCallTrace from crashing on segmentation faults (but not on `guarantee`s).
>> 
>> If a crash is observed, then the `num_frames` field of the trace is set to `ticks_unknown_state` (-7) to signal a state that cannot be properly handled. `ticks_unknown_state` is currently also used for signaling unknown thread states but this should not be a problem, as the semantic is the same. If `num_frames` already has an error code then this error code is not changed. This helps to distinguish between errors in walking threads in Java and non-Java mode, as `num_frames` is set there before the walking to the appropriate error code.
>> 
>> _Thanks for @tstuefe for suggesting this._
>
> Johannes Bechberger has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Handle nested ThreadCrashProtection on POSIX

It could be worth exploring unifying the JFR and AGCT stackwalking mechanism then. If the crash handler is good enough for JFR, it should be considered good enough for AGCT.

As raised in a separate email exchange, this doesn't however fix all kind of crashes, such as calls to `guarantee(...)` (like in `CodeCache::find_blob`) since it does not trigger a signal but calls `VMError::report_and_die` directly (or equivalent). Since the code of JFR for stackunwinding is pretty much a copy/paste of AGCT, it's not clear to me what's the difference is (which locks are being held? which other synchronization mechanism?). By unifying, and making sure we use JFR's mechanism wholesale in AGCT, we would make sure AGCT is at least as stable as JFR. From our production use, JFR is stable (we haven't observe consistent crashes like we do with AGCT), so that is to me the reference we need to match.

I understand the AGCT2 proposal is unifying + native frames. So we could take the unifying and make it part of this enhancement to AGCT and keep the native frames for AGCT2.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8225


More information about the hotspot-dev mailing list