RFR: 8157023: Integrate NMT with JFR

Fri Dec 2 14:10:13 UTC 2022

On Fri, 2 Dec 2022 11:51:52 GMT, Stefan Johansson <sjohanss at openjdk.org> wrote:

>> Hi @kstefanj ,
>> 
>>> Haven't looked at the details of `baseline(bool summaryOnly)` that much, but since it `summaryOnly = true` I don't think it actually walk the thread stacks, right?
>> 
>> I think we do, since summary info contains information about committed thread stack size. I think the call chain is 
>> 
>> 
>> MemBaseline::baseline()->
>> MemBaseline::baseline_summary()->
>> VirtualMemorySummary::snapshot()->
>> VirtualMemoryTracker::snapshot_thread_stacks()->
>> VirtualMemoryTracker::walk_virtual_memory()->
>> SnapshotThreadStackWalker::do_allocation_site()->
>> RegionIterator::next_committed()->
>> os::committed_in_range()->
>> mincore loop
>> 
>> 
>>> 
>>> We don't do this at a safepoint but looking at `MallocMemorySnapshot::copy_to(...)` it uses `ThreadCritical` to avoid things being cleaned out at the same time. Not sure if there are other thread safety problems though. I would expect this to have the same problems as a summary report triggered through `jcmd` because that isn't run at a safepoint either.
>> 
>> You are right, I could have sworn we do this with a VM op. Must have confused this with metaspace printing.
>> 
>>> But I do see that when used from `jcmd` we take a lock to serialize the NMT query, so we should probably do the same here.
>>> 
>>> ```
>>>   // Query lock is used to synchronize the access to tracking data.
>>>   // So far, it is only used by JCmd query, but it may be used by
>>>   // other tools.
>>>   static inline Mutex* query_lock() {
>>>     assert(NMTQuery_lock != NULL, "not initialized!");
>>>     return NMTQuery_lock;
>>>   }
>>> ```
>>> 
>>> So we certainly need to look closer at this. I would like to understand why the query lock is needed.
>> 
>> To serialize jcmds running in parallel. Since jcmd is done by the attach listener thread, the only way I can think of would be if someone in parallel executes the command via mbean? Not sure.
>> 
>> Note that there is no way to get consistent counter values across all categories anyway. All the counters are updated atomically without lock, and that is all we can afford in malloc. Counters are not synchronized among themselves.
>
>> I think we do, since summary info contains information about committed thread stack size. 
> 
> You are totally correct, somehow missed that part of `VirtualMemorySummary::snapshot()`. So yes, we should really look at only doing one snapshot per period. Any feeling for exactly how expensive this walk is?
> 
>> To serialize jcmds running in parallel. Since jcmd is done by the attach listener thread, the only way I can think of would be if someone in parallel executes the command via mbean? Not sure.
> 
> Ok, so if executed via the mbean it would be executed in another thread but still use `DCMD::execute()` and the lock to make sure those are synchronized. Anyhow, do you agree with my above understanding that what is protected by the lock is the "cached baseline"? So as long as JFR doesn't use it, the lock should not be needed?
> 
>> Note that there is no way to get consistent counter values across all categories anyway. All the counters are updated atomically without lock, and that is all we can afford in malloc. Counters are not synchronized among themselves.
> 
> Agreed, this is a snapshot at a given point of a moving target.

> Looking a bit more at this. To me it seems like we are protecting the stored baseline with the lock is:
> 
> ```
> class MemTracker : AllStatic {
> ...
>   // Stored baseline
>   static MemBaseline      _baseline;
> ...
> };
> ```
> 
> That the `NMTDCmd` code uses through `MemTracker::get_baseline()`. So if the JFR events keep its own baseline we should be fine without the lock. Or am I missing something here?

I think this should be safe. 

But I still think about thread stack probing. Your sampler will compete via ThreadCritical with thread creation and destruction (and a number of other unrelated things) because it needs to walk the virtual memory region chains. I think NMT reporting was never planned to be done in high volume. 

I experimented a lot with a large number of threads and NMT thread stack probing. Its results are sometimes surprising, e.g. I regularly get much higher numbers from NMT than for process RSS. E.g. NMT tells me we have 4g committed thread stack when RSS is just 630M. This is without swapping.

So I am not sure how reliable thread stack probing actually is in reality. I know that Zhengyu put a lot of work into it and it needed a number of iterations. Since all the rest together is just reading out counters and reporting them, thread stack probing is the only unknown time-wise. It is O(n) over both number of threads and thread stack size and will potentially lock. We can improve matters somewhat with smarter locking, but the O(n) part remains.

Maybe leave that stuff out and just report the normal counters? Or can we live with large jitters here? This is a JFR question and I'm not a JFR developer :)

-------------

PR: https://git.openjdk.org/jdk/pull/11449