RFR: 8304824: NMT should not use ThreadCritical

Fri Sep 6 08:34:51 UTC 2024

On Fri, 6 Sep 2024 04:27:44 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> ### Summary
>> This PR just replaces `ThreadCritical` with a lock specific to NMT.  `ThreadCritical` is a big lock and is unnecessary for the purposes of NMT. I've implemented the new lock with a semaphore so that it can be used early before VM init.  There is also the possibility of adding assertions in places we expect NMT to have synchronization. I haven't added assertions yet in many places because some code paths such as the (NMT tests)  don't lock yet. I think it makes sense to close any gaps in locking in another PR in which I can also add more assertions. 
>> 
>> Testing:
>> - hotspot_nmt
>> - gtest:VirtualSpace
>> - tier1
>
> src/hotspot/share/nmt/nmtCommon.hpp line 153:
> 
>> 151:       intx const current =  os::current_thread_id();
>> 152:       intx const owner = Atomic::load(&_owner);
>> 153:       assert(current != owner, "Lock is not reentrant");
> 
> ThreadCritical is reentrant though. We would need to do a lot of testing and/or code analysis to ensure we don't have the possibility of needing a reentrant lock with NMT. The most likely problems would be if we triggered a crash whilst in a locked section of code.

We have a healthy amount of tests already. We need a wide use of assert_locked and need to take a close look at error reporting. Locking, in NMT, is rather simple.

Error handling: My pragmatic "good enough for this rare scenario" proposal would be: 

At the entry of error reporting, if the NMT mutex is locked by the crashing thread, just unlock it manually and try to continue. The chance is high that this will actually work well enough to survive the few mallocs we may do during error reporting. 

Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. The time windows where things are out of sync and reentrant use of the same functionality can bite us are small. Moreover, NMT's two subsystems - malloc tracker and mmap tracker - are mostly independent from each other, and even leaving the mmap tracker in an inconsistent state will not prevent us from doing malloc during error reporting. We don't use mmap during error reporting, just malloc. Finally, moreover, NMT malloc code is comparatively simple and stable; mmap code is more complex, and we plan to rework parts of it in the near future.

TL;DR I think just manually releasing the lock in error reporting has a high chance of succeeding. Especially if we disable the NMT report in the hs-err file I am not even sure this is necessary, though. We have safeguards against error reporting locking up (step timeouts and secondary crash guards).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1746714270