RFR: 8304442: Defer VirtualMemoryTracker work until reporting

Sun Mar 19 07:55:18 UTC 2023

On Sat, 18 Mar 2023 14:50:48 GMT, Johan Sjölen <jsjolen at openjdk.org> wrote:

> Hi,
> 
> The virtual memory tracker of NMT used to do a lot of heavy linked list calculations when memory was reserved, committed, uncommited or split up. However, the results of these calculations are actually only used when creating a native memory report. Let's not slow down the JVM unnecessarily, we can do this work at time of report instead.
> 
> In order to achieve this I've replaced the public API with a work queue:ing solution. We append each work item to a `GrowableArray` and introduce the `commit_events` method to do the actual work, which we call internally when needed.
> 
> I measured the gains in efficiency through the use of Valgrind's Cachegrind tool. I ran a `linux-x64` build with the following source code:
> 
> 
> public class Test {
>     public static void main(String[] args) {
>     }
> }
> 
> 
> These are the total cycles executed by `os::commit` and `os::reserve` as estimated by Valgrind over the entire run of the program. The tests were only run once.
> 
> 
> java -XX:NativeMemoryTracking=detail Test.java
> 
> os::commit_memory
> old         | new         | old / new
> 935238      | 578979      | 1.6
> os::reserve_memory
> old         | new         | new / old
> 53628       | 21825       | 2.4
> 
> java -XX:NativeMemoryTracking=summary Test.java
> 
> os::commit_memory
> old     | new   | old/new
> 1033701 | 59974 | 17.2
> 
> os::reserve_memory
> old   | new  | old/new
> 10067 | 2016 | 5
> 
> 
> 
> In summary mode we get the largest performance gains as `NativeCallStack` is missing.
> 
> There should also be some memory savings as a `MemoryEvent` is smaller (64 bytes) than a `ReservedRegion` (96 bytes). That is, until a `commit_events()` occur.

Hi Johan,

that is an interesting idea, but I am not convinced, which is a pity since the PR is well prepared and the callgraphs are nice.

A problem and a question of usefulness:

The problem is that you really don't want to defer NMT accounting to report time. At the time of the report, you may be out of memory or out of time, or both. E.g., during error reporting as a result of a native OOM. In these situations, NMT reports are very useful, but they must be fast and should not rely on malloc. Malloc may not work anymore, or it may hang. I argue that reports are too expensive today and should be made snappier and cheaper.

Another context where NMT reports are called is as part of JFR value sampling. Again, in the JFR sampler thread, you don't want to spend much time processing deferred allocations. There are other examples, at least in SAP's downstream VM. I think getting NMT numbers should be quick and painless, and possible in any situation.

The usefulness question: Either mapping management in NMT is hot, or it isn't. If it isn't, there is no point in optimizing it. 

If it *is* hot, e.g., because you call os::commit a million times (?), a queue may not work as well as you think. You now accumulate an ever-growing footprint for the queue. So you need to dump the queue at some point. If you do, you lose the advantage of deferring. If you don't dump, you now have a memory leak essentially, and reporting will take a lot longer.

-----

I keep thinking that the real problem here is that virtual memory management in NMT is not optimal. And that we should make it more efficient and hopefully simpler.

Side note, the list of mappings should never be unbearably large. Because it mirrors mappings on the OS side, and if we have millions of them, we have an address space fragmentation problem. But usually, the list is quite small. Low hundreds or thousands of entries.

Here are some ideas for simplification and for making NMT vm registration cheaper:

- The number of mappings is probably dominated by the number of threads. Because all other users of virtual memory (GC, metaspace, ...) typically don't fragment address space too much. But if we have 10000 threads, we have as 10000 stacks + 10000 guard pages. I believe we currently keep stack information piggybacked in NMT as virtual memory? If so, this is double accounting. Because we already have a perfectly valid list of threads, and the threads know where their stacks are. Instead of walking the list of mappings to account for stacks, we could just walk the original list of threads to get the stack sizes. The NMT list of mappings would be smaller, making insertion and walking cheaper.

- We keep mappings in NMT in a linked list. That is not the best way for walking nor for sorted insert. Walking causes cache misses, and sorted inserting is O(n). I keep thinking the best data structure would be a binary tree of address points, which may be too complex a rewrite. But we could experiment with a flat array. Inserting would be more expensive - you need to push follow-up entries out of the way - but inserting and walking are now much more cache friendly. Even more so if we condense the mapping information. I think 16 bytes per entry should be enough. 8 byte start pointer, 4 byte number of pages (gives you a total range of 8TB per mapping), 4 byte for the meta info (is committed, is reserved, NMT flag...). So now you have a flat array of 16-byte-entries per mapping. That may be faster for inserting/walking than todays linked list, even with the more complex insert.

These are just some ideas; I have not tried them, they may not pan out as I think. The original author of NMT did a lot of investigation, and I may overlook something. I'm interested in other opinions. Sorry for not liking your deferral idea!

Cheers, Thomas

-------------

PR: https://git.openjdk.org/jdk/pull/13088