RFR: 8304824: NMT should not use ThreadCritical [v3]

Robert Toyonaga duke at openjdk.org
Sat Sep 14 02:38:07 UTC 2024


On Fri, 13 Sep 2024 16:17:56 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>>> After switching to a Hotspot Mutex, it looks like the `windows-x64 / test (hs/tier1 common) GHA` is failing because the test `release_bad_ranges` in test_os.cpp is expecting an assertion and an error message to be printed. However, when this printing happens, `tty_lock` gets acquired out of rank order with the already held `NMT_lock`, causing the lock rank assertion fail. One solution would be to lower the rank of `tty_lock`. I'm not sure that's the best solution because that might cause rank conflicts with other locks (and it makes sense to give locks the highest possible rank to minimize future conflicts).
>> 
>> What code exactly locks tty_lock?
>
>> > After switching to a Hotspot Mutex, it looks like the `windows-x64 / test (hs/tier1 common) GHA` is failing because the test `release_bad_ranges` in test_os.cpp is expecting an assertion and an error message to be printed. However, when this printing happens, `tty_lock` gets acquired out of rank order with the already held `NMT_lock`, causing the lock rank assertion fail. One solution would be to lower the rank of `tty_lock`. I'm not sure that's the best solution because that might cause rank conflicts with other locks (and it makes sense to give locks the highest possible rank to minimize future conflicts).
>> 
>> What code exactly locks tty_lock?
> 
> This is weird. We print the error in this case via print_error_for_unit_test(), which takes care only to use raw fprintf(stderr) for this very reason... therefore I am interested in understanding the lock semantics, to know if deadlock detection found a real problem or if this is just a weird error setup.

Hi @tstuefe, it looks like calling [`os::print_memory_mappings`](https://github.com/openjdk/jdk/blob/master/src/hotspot/os/windows/os_windows.cpp#L3785) from the windows implementation of `os::pd_release_memory` is causing the rank conflict between `tty_lock` and `NMT_lock`.  This is getting called from `os::release_memory` [**after** we've already acquired the lock for NMT](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/os.cpp#L2185-L2189).

gtest `release_bad_ranges` --> `os::release_memory` and acquire `NMT_lock` --> `os::pd_release_memory` --> `os::print_memory_mappings` and acquire `tty_lock`

Is there a reason we cast a wider net and lock from outside of NMT code in `os::release_memory`, `os::uncommit_memory`,  and `os::unmap_memory`?  By contrast, in other functions such as `os::commit_memory`  and `os::reserve_memory` we lock inside of `MemTracker` instead. 
For example, `pd_release_memory` is protected by the NMT lock, but `pd_reserve_memory` is not. Is the lock meant to synchronize the actual memory operations as well as NMT recording in some cases? If not, then maybe it makes sense to  delay and move the lock acquisition inside `MemTracker` for consistency and to reduce surface area for lock conflicts.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20852#issuecomment-2350794913


More information about the serviceability-dev mailing list