RFR: 8293166: jdk/jfr/jvm/TestDumpOnCrash.java fails on Linux ppc64le and Linux aarch64 [v2]

Wed Nov 9 14:44:46 UTC 2022

On Wed, 9 Nov 2022 08:44:56 GMT, Ralf Schmelter <rschmelter at openjdk.org> wrote:

>> Disabling tiered compilation avoids the sporadic failures of the test.
>> 
>> On ppc64 and aarch64 a trap-based mechanism is used to switch from tier 1 to higher tiers. In the test a crash is provoked and a secondary error handler is installed at the start of error reporting, which doesn't handle these traps anymore and just stops the thread. But since the thread state is 'in Java', this prevents any safepoint to be executed. And this causes the JFR emergency dump to hang in a native to VM transition, so the dump is not written and the test fails (see the  bug report for more details).
>
> Ralf Schmelter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Adjust copyright

+1

After discussing this offline with Ralf, I understand the problem better, and I think the workaround makes sense. 

IIUC the problem is that JFR dumper attempts a SafePoint while in fatal error reporting. Another thread (the JFR sampler thread) happens to be in Java, attempts to enter the safepoint, but gets stuck because we switched the signal handler and nobody is there to handle SIGTRAP. It gets stalled into an endless loop in VMError::report_and_die ("Thread ABC also had an error"). Reporting thread then times out, no JFR dump.

The underlying problem is that we don't handle SafePoint faults in the crash handler. We could add SafePoint handling to the secondary crash handler - we already handle SafeFetch faults the same way. But do we want to do this? We are in a fatal error situation. The fact that all threads stop cold once they enter SafePoints could even be seen as a feature: if the VM is in fatal error mode and does error reporting, we want as little code running concurrently as possible. To not interfere with error reporting and to get unspoiled cores.

---

Some details I don't understand yet. Why does it only happen on aarch64 and PPC? Should this not happen on all platforms? The SIGSEGV mechanism should not work there either.

And this is more of a @TheRealMDoerr  question: I'm confused about how ppc implements SafePoints. Why do we use SIGTRAP? Its handling seems so complex. On all other platforms, we just "SIGSEGV +  crash address in polling page -> goto safepoint stub". On PPC, if SIGTRAP, we need to check instruction (TD), and then we search the code blob. We do three linear searches, twice for the code heap and once for the blob in the code heap (CodeCache::contains and CodeCache::find_blob). So much more work than what the other platforms do. I must be missing something obvious. Why is SIGTRAP better?

Cheers, Thomas

-------------

Marked as reviewed by stuefe (Reviewer).

PR: https://git.openjdk.org/jdk/pull/10943