RFR: 8284828: Use `os::ThreadCrashProtection` to protect AsyncGetCallTrace from crashing
Jaroslav Bachorik
jbachorik at openjdk.java.net
Wed Apr 13 15:47:08 UTC 2022
On Wed, 13 Apr 2022 15:29:00 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
>> Move the AsyncGetCallTrace method implementation into a separate method and wrap its call in non-assert compilation mode in `os::ThreadCrashProtection` like it is done in [JFR](https://github.com/openjdk/jdk/blob/965ea8d9cd29aee41ba2b1b0b0c67bb67eca22dd/src/hotspot/share/jfr/periodic/sampling/jfrThreadSampler.cpp#L165).
>> This prevents AsyncGetCallTrace from crashing on segmentation faults (but not on `guarantee`s).
>>
>> If a crash is observed, then the `num_frames` field of the trace is set to `ticks_unknown_state` (-7) to signal a state that cannot be properly handled. `ticks_unknown_state` is currently also used for signaling unknown thread states but this should not be a problem, as the semantic is the same. If `num_frames` already has an error code then this error code is not changed. This helps to distinguish between errors in walking threads in Java and non-Java mode, as `num_frames` is set there before the walking to the appropriate error code.
>>
>> _Thanks for @tstuefe for suggesting this._
>
> I'm actually ambivalent about the "production only" thing.
>
> I dislike having different behavior between debug and release, I prefer to test what later runs in the field. Also, if we prevent crashes because we want to ignore them, we should ignore them in debug too, otherwise, we burn error analysis cycles needlessly.
>
> Optionally, I'd make the behavior of ThreadCrashProtection switchable at runtime with a diagnostic XX switch. That switch would control the jumping back in the signal handler. That way, if you encounter strange bugs while using async profiler, one can disable the crash guard and enable asserts and crashes. If you decide to do that, we should also do tests for it, so maybe leave it for a future RFE.
What @tstuefe said about debug vs. production made me wonder whether the assert in `ThreadCrashProtection` is still useful at all. I mean, we are deliberately using it from threads other than the JFR sampler thread and if that usage is not breaking stuff left and right, probably we could remove it? (/cc @mgronlun)
-------------
PR: https://git.openjdk.java.net/jdk/pull/8225
More information about the serviceability-dev
mailing list