RFR: 8304824: NMT should not use ThreadCritical [v9]

Thu Oct 31 16:51:37 UTC 2024

On Thu, 31 Oct 2024 12:34:38 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>>> I had to analyze this again, to understand why we need this locking, since my mind slips.
>>> 
>>> I updated my findings in https://bugs.openjdk.org/browse/JDK-8325890 . Please see details there.
>>> 
>>> But I don't think the current locking makes much sense, tbh (even before this patch). Or I don't get it.
>>> 
>>> We have three counters in play here: A) the `malloc[mtChunk]` malloc counter for the mtChunk tag B) the global malloc counter C) the `arena[xxx]` arena counter for the mtXXX tag the arena is tagged with
>>> 
>>> Upon arena growth, we incremement all four. In two steps - first (A) and (B) under os::malloc, then (C) in the arena code.
>>> 
>>> Upon arena death, we may delete the chunk, in which case we adjust all three counters as well.
>>> 
>>> But since we don't want (B) to include arena sizes, we have several places where we lazily adjust (B) and (A) from the sum of all arena counters (see JBS issue). These adjustments must be synchronized with arena allocations. But we don't do this. The lock does not include (C). For that, the lock would have to be at a different place, inside arena code. We only lock around (A) and (B).
>>> 
>>> I am not sure what the point of this whole thing is. I am also not convinced that it always works correctly. This whole "updating counters lazily" business makes the code brittle.
>>> 
>>> Also, the current locking does not guarantee correctness anyway. There are several places in the coding where we calculate e.g. the sum of all mallocs, but malloc counters are modified (rightly so) out of lock protection.
>>> 
>>> To make matters even more complex, we have not two locks here but three:
>>> 
>>>     * ThreadCritical
>>> 
>>>     * the new `NmtVirtualMemory_lock`
>>> 
>>>     * we also have the `NMTQuery_lock`, which guards NMT reports from jcmd exclusively
>>> 
>>> 
>>> Not sure what the point of the latter even is. It certainly does not prevent us from getting reports while underlying data structures are being changed, unless I am overlooking something.
>>> 
>>> It would be very nice if someone other than me could analyze this.
>> 
>> If this is so brittle and complex, then I wonder what you even get out of us doing the `ThreadCritical` trick. In other words, if we just removed it, would anyone notice and be able to discern what's occurred? Open question, I might do that change and run the tests on it.
>
>> If this is so brittle and complex, then I wonder what you even get out of us doing the ThreadCritical trick. In other words, if we just removed it, would anyone notice and be able to discern what's occurred? Open question, I might do that change and run the tests on it.
> 
> @jdksjolen good question. I am not sure who introduced that, may have been Afshin as a fix for some NMT test.

Hi @tstuefe,

> But since we don't want (B) to include arena sizes, we have several places where we lazily adjust (B) and (A) from the sum of all arena counters (see JBS issue). These adjustments must be synchronized with arena allocations. But we don't do this. The lock does not include (C). For that, the lock would have to be at a different place, inside arena code

Yes, updating all three counters (A/B/C) is not atomic when [growing](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/memory/arena.cpp#L294-L308) an arena or when [destroying](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/memory/arena.cpp#L240) an arena. I think the `ThreadCritical`, that's meant to keep the accounting in sync, needs to be further up the stack when destroying arenas. What's strange is that there's no `ThreadCritical` protecting the growing operation at all. Why was there an attempt to only protect the freeing?  

I think this is also confusing because its not consistent whether we count arena sizes as part of malloc. In the NMT report total, arena is counted as malloc. But in the reported individual categories, arena is disjunct from malloc. 

The solution you mentioned in https://bugs.openjdk.org/browse/JDK-8325890 seems like it would solve this consistency problem while also avoiding the double counting problem with mtChunk.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20852#issuecomment-2450355713