RFR: 8304824: NMT should not use ThreadCritical

Fri Sep 6 10:39:52 UTC 2024

On Fri, 6 Sep 2024 08:32:01 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> src/hotspot/share/nmt/nmtCommon.hpp line 153:
>> 
>>> 151:       intx const current =  os::current_thread_id();
>>> 152:       intx const owner = Atomic::load(&_owner);
>>> 153:       assert(current != owner, "Lock is not reentrant");
>> 
>> ThreadCritical is reentrant though. We would need to do a lot of testing and/or code analysis to ensure we don't have the possibility of needing a reentrant lock with NMT. The most likely problems would be if we triggered a crash whilst in a locked section of code.
>
> We have a healthy amount of tests already. We need a wide use of assert_locked and need to take a close look at error reporting. Locking, in NMT, is rather simple.
> 
> Error handling: My pragmatic "good enough for this rare scenario" proposal would be: 
> 
> At the entry of error reporting, if the NMT mutex is locked by the crashing thread, just unlock it manually and try to continue. The chance is high that this will actually work well enough to survive the few mallocs we may do during error reporting. 
> 
> Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. The time windows where things are out of sync and reentrant use of the same functionality can bite us are small. Moreover, NMT's two subsystems - malloc tracker and mmap tracker - are mostly independent from each other, and even leaving the mmap tracker in an inconsistent state will not prevent us from doing malloc during error reporting. We don't use mmap during error reporting, just malloc. Finally, moreover, NMT malloc code is comparatively simple and stable; mmap code is more complex, and we plan to rework parts of it in the near future.
> 
> TL;DR I think just manually releasing the lock in error reporting has a high chance of succeeding. Especially if we disable the NMT report in the hs-err file I am not even sure this is necessary, though. We have safeguards against error reporting locking up (step timeouts and secondary crash guards).

Thinking through this more, we can probably shorten my advice to this:

When entering error handling, if NMT lock is held by crashing thread and NMT level is detail, switch back to NMT level summary.

That's it.

Reasoning: we only do mallocs inside hs-err generation. For malloc, we only lock inside NMT if the NMT level is detail. And this should never change: malloc is too hot to justify a lock in summary mode.

The only point where possible crashes or deadlocks can occur in error reporting is when doing the NMT report itself - but chances for this to happen are remote, and we have secondary crash handling and error step timeouts to deal with this if it happens.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1746910791