RFR: 8284828: Use `os::ThreadCrashProtection` to protect AsyncGetCallTrace from crashing [v9]

Thu Apr 14 16:47:31 UTC 2022

On Thu, 14 Apr 2022 15:19:17 GMT, Johannes Bechberger <duke at openjdk.java.net> wrote:

>> Move the AsyncGetCallTrace method implementation into a separate method and wrap its call in non-assert compilation mode in `os::ThreadCrashProtection` like it is done in [JFR](https://github.com/openjdk/jdk/blob/965ea8d9cd29aee41ba2b1b0b0c67bb67eca22dd/src/hotspot/share/jfr/periodic/sampling/jfrThreadSampler.cpp#L165).
>> This prevents AsyncGetCallTrace from crashing on segmentation faults (but not on `guarantee`s).
>> 
>> If a crash is observed, then the `num_frames` field of the trace is set to `ticks_unknown_state` (-7) to signal a state that cannot be properly handled. `ticks_unknown_state` is currently also used for signaling unknown thread states but this should not be a problem, as the semantic is the same. If `num_frames` already has an error code then this error code is not changed. This helps to distinguish between errors in walking threads in Java and non-Java mode, as `num_frames` is set there before the walking to the appropriate error code.
>> 
>> _Thanks for @tstuefe for suggesting this._
>
> Johannes Bechberger has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Use JavaThread::current_or_null

One important difference to consider is that in JFR, in contrast to AGCT. there is only a single thread, the ThreadSampler thread, that is wrapped in the CrashProtection. Stack walking is different in JFR compared to AGCT, in that it is done by a _different_thread_, during a point where the target is suspended. Originally, this thread sampler thread was not even part of the VM, although now it is a NonJavaThread. It has been trimmed to not involve malloc(), raii, and other hard-to-recover from constructs, from the moment it has another thread suspended. Over the years, some transitive malloc() calls has snuck in, but it was eventually found due to rare deadlocking. Thomas brings a good point about crashes needing to be recoverable.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8225