RFR: 8284828: Use `os::ThreadCrashProtection` to protect AsyncGetCallTrace from crashing [v4]

Thu Apr 14 12:59:26 UTC 2022

On Thu, 14 Apr 2022 11:30:13 GMT, Johannes Bechberger <duke at openjdk.java.net> wrote:

> > We have only one other subsystem that does this kind of wholesale catching of error signals and then unwinding the stack (JFR) and presumably they fine-combed their coding and are sure exactly what they do under the sigjmp guard. Are we also certain?
> 
> They share almost all of their code. Allocating on the heap is not safe in the signal handler and should be considered an error.
>
> My goal is remove all code that (potentially) modifies the heap or the state of the VM from AsyncGetCallTrace, with the focus on the methods that are called from AsyncGetCallTrace but from JFR. Both are so similar that this practical.
> 
> Regarding the `NoHandleMark`: I must have missed this.
> 
> > I think the safer approach, albeit much more work intensive, would be to make sure we do not crash. Starting with removing RA allocation. I tried to make RA signal safe with https://bugs.openjdk.java.net/browse/JDK-8282405, but that got totally stuck, so better just remove it altogether from AGCT. Over using SafeFetch or plain defensive coding to avoid crashing.
> 
> They are complementary goals. My goal is to improve ASGCT and its stability but this PR should help prevent rare segmentation faults to appear in production.

Sorry, I remain unconvinced.

Catching and ignoring error signals is akin to silencing the smoke detector because the noise bothers you. Your house still burns.

With this patch, we would ignore error signals and the VM would continue running, with a potentially corrupted state of VM and/or system libraries. That may cause silent data corruption, or even security holes. A hard fast crash is always better.

We deliberately never return from error handling for this very reason. Once we report a crash, we never even unwind the stack. There had been attempts in the past to return from error handling to normal VM mode for this reason and that, and we always avoided it.

The sigjmp technique is fine if you can guarantee that your code does not change state beyond what lives on the thread stack. No global state can change then, not in the VM itself, nor in system libraries. In order to prove that your catch-all patch is safe, you would have to fine-comb AGCT and all the code it calls and make it safe to always jump out of. I believe the work involved would be much higher than just fixing the crashes, and it would be much more prone to bitrot.

I believe there is no easy way here. I think if you want to make AGCT safe, you have to analyze and solve the individual crash points. Because if you catch them wholesale, how do you know which crashes are safe to ignore, and which should stop the VM? And even if the crash is safe to ignore, how do you know that by sigjumping back you won't leave the VM in an inconsistent state?

-------------

PR: https://git.openjdk.java.net/jdk/pull/8225