RFR: 8304824: NMT should not use ThreadCritical [v2]

Sat Sep 7 17:33:40 UTC 2024

On Fri, 6 Sep 2024 10:35:39 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> We have a healthy amount of tests already. We need a wide use of assert_locked and need to take a close look at error reporting. Locking, in NMT, is rather simple.
>> 
>> Error handling: My pragmatic "good enough for this rare scenario" proposal would be: 
>> 
>> At the entry of error reporting, if the NMT mutex is locked by the crashing thread, just unlock it manually and try to continue. The chance is high that this will actually work well enough to survive the few mallocs we may do during error reporting. 
>> 
>> Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. The time windows where things are out of sync and reentrant use of the same functionality can bite us are small. Moreover, NMT's two subsystems - malloc tracker and mmap tracker - are mostly independent from each other, and even leaving the mmap tracker in an inconsistent state will not prevent us from doing malloc during error reporting. We don't use mmap during error reporting, just malloc. Finally, moreover, NMT malloc code is comparatively simple and stable; mmap code is more complex, and we plan to rework parts of it in the near future.
>> 
>> TL;DR I think just manually releasing the lock in error reporting has a high chance of succeeding. Especially if we disable the NMT report in the hs-err file I am not even sure this is necessary, though. We have safeguards against error reporting locking up (step timeouts and secondary crash guards).
>
> Thinking through this more, we can probably shorten my advice to this:
> 
> When entering error handling, if NMT lock is held by crashing thread and NMT level is detail, switch back to NMT level summary.
> 
> That's it.
> 
> Reasoning: we only do mallocs inside hs-err generation. For malloc, we only lock inside NMT if the NMT level is detail. And this should never change: malloc is too hot to justify a lock in summary mode.
> 
> The only point where possible crashes or deadlocks can occur in error reporting is when doing the NMT report itself - but chances for this to happen are remote, and we have secondary crash handling and error step timeouts to deal with this if it happens.

Ok I've added some logic in `VMError::report` to switch to summary mode (I think this is the right place). However, if we need to do error reporting while the `NmtGuard` is already held, then I think we might still have reentrancy even in summary mode. After the latest commit that replaces `MemoryFileTracker`'s locker with `NmtGuard`, `MemBaseline::baseline_summary()` now acquires `NmtGuard` too.   

Because of this I've also made `NmtGuard` reentrant. Maybe handling reentrancy would be beneficial in future/unforeseen situations as well. But if this is unnecessary, I don't mind removing it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1748744931