RFR: 8346123: [REDO] NMT should not use ThreadCritical [v6]
Thomas Stuefe
stuefe at openjdk.org
Fri Dec 20 18:26:49 UTC 2024
On Thu, 19 Dec 2024 21:04:51 GMT, Robert Toyonaga <duke at openjdk.org> wrote:
>> This is a redo of [JDK-8304824](https://bugs.openjdk.org/browse/JDK-8304824) which was backed out by [JDK-8343726](https://bugs.openjdk.org/browse/JDK-8343726) due to problems documented in [JDK-8343244](https://bugs.openjdk.org/browse/JDK-8343244).
>>
>> The problem was that `NmtVirtualMemoryLocker` was not locking when the current thread is not attached by checking `Thread::current_or_null_safe() != nullptr`. This is necessary during VM init, but should not be allowed afterward. NMT may be used in `attach_current_thread` before the current thread is set. The lock was not being acquired in that case, which intermittently caused NMT accounting to become corrupted, triggering various assertions when future NMT operations are done. To fix this, I've adopted [Thomas' suggestion](https://github.com/openjdk/jdk/pull/21928#issuecomment-2460238057) to reverse the order of
>>
>>
>> thread->register_thread_stack_with_NMT();
>> thread->initialize_thread_current();
>>
>>
>> in `attach_current_thread`. This allows `NmtVirtualMemoryLocker` to be locked after current thread is set.
>>
>> To allow for `NmtVirtualMemoryLocker` to be used during VM init, I've replaced the `ConditionalMutexLocker` check `Thread::current_or_null_safe() != nullptr` with a new flag: `_done_bootstrap`. This flag prevents the lock from being used during VM init, at which point we are single threaded anyway. This avoids errors due to Hotspot mutexes and current thread not yet being ready.
>>
>> I also added new asserts in `virtualMemoryTracker.cpp` to catch future bugs like this where the lock is not held when it should be. I updated the appropriate VMT tests to also lock (there were a few cases where locking was being bypassed) so they can pass the new asserts.
>>
>> I also removed the unused `_query_lock` variable in `MemTracker`.
>>
>> Testing:
>>
>> - On Linux amd64, I was able to consistently reproduce the errors described in [JDK-8343244](https://bugs.openjdk.org/browse/JDK-8343244) by increasing the number of test threads in `java/lang/Thread/jni/AttachCurrentThread/AttachTest.java`. The test consistently passes with the new changes in this PR.
>> - hotspot_nmt , gtest:VirtualSpace, tier1
>
> Robert Toyonaga has updated the pull request incrementally with one additional commit since the last revision:
>
> Revert to using NmtVirtualMemoryLocker. Use defaultStream. Add comment for SharedDecoder_lock
@roberttoyonaga thank you for taking this on again.
Could you please add a comment somewhere that describes the locking problem we try to solve in all its ugliness? I keep forgetting the details myself.
Something along these lines:
We need locking in NMT for mmap-based allocated (malloc is lockless). Because mmap-based allocations use global data structures. But we also need those lock sections to include the OS-side mmap/munmap calls that caused modifications of said sections. So, `mmap` calls must happen together with their associated `MemTracker::record_reserve`, same for `munmap` and `MemTracker::record_release`. Without locking, NMT may see states that are invalid. For example this order: `T1 calls munmap; T2 calls mmap (same region); T2 calls MemTracker::record_reserve; T1 calls MemTracker::record_release;` would result in NMT seeing a `MemTracker::record_reserve` for a region that, in its opinion, is still occupied.
So we lock, and we want to use Mutex. This has two problems: a) does not work before mutexes have been initialized. b) does not work if Thread::current is NULL.
(a) is a problem for early mmaps. (Example: polling page setup ?)
(b) is a problem if a thread tries to use mmap but has not yet set Thread::current. This was stack region registration with NMT (because we misuse NMT mmap management for registering stack threads).
src/hotspot/os/windows/os_windows.cpp line 3626:
> 3624: os::print_memory_mappings((char*)start, bytes, &fs);
> 3625: assert(false, "bad release: [" PTR_FORMAT "-" PTR_FORMAT "): %s", p2i(start), p2i(end), err);
> 3626: #endif
@roberttoyonaga Thinking about this, I propose to just remove the os::print_memory_mappings call here. I added this way back when we had problems on Windows with removing "striped" NUMA mappings, but those issues have long been solved. We still have the assertion check, so we will notice if something goes wrong.
-------------
PR Review: https://git.openjdk.org/jdk/pull/22745#pullrequestreview-2513685220
PR Review Comment: https://git.openjdk.org/jdk/pull/22745#discussion_r1894197615
More information about the serviceability-dev
mailing list