RFR: 8282475: SafeFetch should not rely on existence of Thread::current [v6]

Fri Mar 11 23:37:41 UTC 2022

On Fri, 11 Mar 2022 16:34:29 GMT, Anton Kozlov <akozlov at openjdk.org> wrote:

> > blocking SIGSEGV and SIGBUS - or other synchronous error signals like SIGFPE - and then triggering said signal is UB. What happens is OS-dependent. I saw processes vanishing, or hang, or core. It makes sense, since what is the kernel supposed to do. It cannot deliver the signal, and deferring it would require returning to the faulting instruction, that would just re-fault.
> > For some more details see e.g. https://bugs.openjdk.java.net/browse/JDK-8252533
> 
> This UB looks reasonable. My point is that a native thread would run fine with SIGSEGV blocked. But then JVM decides it can do SafeFetch, and things gets nasty.

Blocking synchronous error signals makes zero sense even for normal programs, since you lose the ability to get cores. For the JVM in particular, it also blocks facilities like polling pages, or dynamically querying CPU abilities. So a JVM would not even start with synchronous error signals blocked.

> 
> > > Is there a crash that is fixed by the change? I just spotted it is an enhancement, not a bug. Just trying to understand the problem.
> > 
> > 
> > Yes, this issue is a breakout from https://bugs.openjdk.java.net/browse/JDK-8282306, where we'd like to use SafeFetch to make stack walking in AsyncGetCallTrace more robust. AGCT is called from the signal handler, and it may run in any number of situations (e.g. in foreign threads, or threads that are in the process of getting dismantled, etc).
> 
> I mean, some way to verify the issue is fixed, e.g. a test that does not fail anymore.

No, tests do not exist. Unfortunately, otherwise this regression would have been detected right away and we would not need this PR.

We have a test though that tests SafeFetch during error handling. That test can be tweaked for this purpose. So, test does not exist yet, but can be easily written. 

> 
> I see AsyncGetCallTrace to assume the JavaThread very soon, or do I look at the wrong place? https://github.com/openjdk/jdk/blob/master/src/hotspot/share/prims/forte.cpp#L569
> 
> > Another situation is error handling itself. When writing an hs-err file, we use SafeFetch to do carefully tiptoe around the possibly corrupt VM state. If the original crash happened in a foreign thread, we still want some of these reports to work (e.g. dumping register content or printing stacks). So SafeFetch should be as robust as possible.
> 
> OK, thanks. I think we also handle recursive segfaults recover after interpretation of the corrupted VM state. Otherwise, implementing the printing functions would be too tedious and hard with SafeFetch alone. But I see it's used in printing register content, at least.

Secondary error handling is a very coarse-grained tool. If an error reporting step crashes out, we continue with the next step. Has disadvantages though. The total number of retries is very limited. And a faulting error reporting step still hurts, because its report is compromised. E.g. if the call stack printing crashes out, we have no call stack. This is not an abstract problem. Its a very concrete and typical problem.

I spend a large part of my work with hs-err reports. They are of very high importance to us. We (SAP) have invested a lot of time and effort in hardening out OpenJDK error reporting, and SafeFetch is an important part of that. For example, we provided the facility that made SafeFetch usable in signal handling. It would be nice if our work was not compromised. Please let us find a way forward here.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7727