RFR: 8284828: Use `os::ThreadCrashProtection` to protect AsyncGetCallTrace from crashing [v6]

Thu Apr 14 15:45:31 UTC 2022

On Thu, 14 Apr 2022 14:33:35 GMT, Johannes Bechberger <duke at openjdk.java.net> wrote:

> I compared the code of ASGCT and JFR and found only one instance of a method that is used in ASGCT, but not in JFR and which does affect the VM state. It is the `JavaThread::block_if_vm_exited` method which is transitively called, a fix for this is coming. Therefore there is, in my opinion, no reason why we cannot wrap AsyncGetCallTrace in ThreadCrashProtection.

Sorry, you did address none of my arguments. I see a reasoning chain that starts by JFR being "good enough" and from there concluding that using it in AGCT its good enough too. Well, maybe it was a bad idea in the first place. And right now, this code is only used by JFR. Reusing it perpetuates this pattern which others will follow. Because its so convenient just to catch all crashes. 

Does the code use Resource Area? Then its not sigjmp safe. Does it do any kind of dynamic allocation which could leak? Then its not sigjmp safe. Does it use locks? Again, not sigjmp safe. Does it use any kind of RAII objects? You guess it, its not sigjmp safe. Does it call libc functions that cause it to change state? Probably unsafe too.

And even if you today can prove that the whole code is safe, how do you prevent bitrot? These are general purpose hotspot functions. Someone may tomorrow use malloc, or RA. How would you know?

The fact that we never saw a problem in JFR does not prove much. A corruption may never show symptoms. It may only effect certain platforms, certain conditions, certain flukes of the memory layout god. There may be that one crash that you really should not have ignored or sigjmp'ed out of, but you never saw it in your tests, because it only happens to that one customer. What I really worry about is the delayed effect. Hiding the crash may cause delayed errors that are really hard to track down. Effectively transforming what would be a simple crash with an hs-err file and a core into a maintenance nightmare. That is why sigjmp is used very sparingly, and only in very controlled environments.

Put it another way: to really prove that what you do is safe, you have to answer, for every possible crash location:
1) is the crash itself benign and can be ignored?
2) is it safe to jump out of the crash code back to the entrance of AGCT, or would that interruption corrupt the VM or a system library?

My point is, if you can answer these questions, you can fix the crash. Therefore it makes no sense to wholesale catch signals. Either the code is safe and does not crash, then no need for catch-all. Or it is not, then its important to understand where we crash and why. Ignoring unknown crashes is not a good idea.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8225