[External] : Re: New NMT Lightweight mode
Afshin Zafari
afshin.zafari at oracle.com
Fri Dec 1 07:55:52 UTC 2023
Thank you Thomas for your valuable comments and thoughts.
Since the design is still in progress and all the change is a prototype yet, all your comments will be considered meanwhile.
Best,
Afshin
________________________________
From: Thomas Stüfe <thomas.stuefe at gmail.com>
Sent: 01 December 2023 08:32
To: Afshin Zafari <afshin.zafari at oracle.com>; Johan Sjolen <johan.sjolen at oracle.com>
Cc: Thomas Stuefe <tstuefe at redhat.com>; hotspot-runtime-dev at openjdk.org <hotspot-runtime-dev at openjdk.org>
Subject: [External] : Re: New NMT Lightweight mode
Some more thoughts:
I wondered how your lightweight mode would track malloc sizes, and do so more cheaply than summary mode.
You cannot use malloc headers since this would increase the footprint (by a large amount in corner cases) and would be unacceptable for an always-on "lightweight" mode.
If you plan on forcing every caller of free() to hand in the size, this would be really ugly and cumbersome. In many cases it would not even work, e.g. for users of Unsafe.freeMemory().
But even if you make malloc tracking work, I suspect it will not be faster than today's summary mode since you don't do anything different compared to summary mode. You increase atomic counters, lockfree - summary mode does the same. So you will pay the same cost as summary mode. But that cost absolutely dominates - see https://stackoverflow.com/questions/73126185/what-is-overhead-of-java-native-memory-tracking-in-summary-mode/73175766#73175766<https://urldefense.com/v3/__https://stackoverflow.com/questions/73126185/what-is-overhead-of-java-native-memory-tracking-in-summary-mode/73175766*73175766__;Iw!!ACWV5N9M2RV99hQ!L6l6nbwTnoGFEoPUn0t2EGQcR_9y3xazBQY5dM87Ts2pjn8T8GciutahGgpa4-bfO4B_R18i4ausP18cVgt0NPnSGWA$>): the examples that were costly in summary were malloc-heavy scenarios. That makes sense since the number of malloc calls outweighs the number of mmap calls by orders of magnitude.
Therefore my assumption is that in malloc-heavy scenarios, the performance cost for lightweight is the same as for summary - a strong argument against enabling lightweight by default. The only performance advantage we have for processing mmap calls, which are rare in comparison; and that performance gain comes at the cost of wildly inaccurate mmap numbers.
I also think rolling out a tool that reports numbers you cannot rely on is not good. Wrong numbers are worse than no numbers.
Sorry for being so insistent about this - I really think the investment in a new lightweight mode, with the complexity it would bring and the new restrictions wrt API usage across the code base, would be better spent by making the existing summary mode faster. That mode was meant to be the lightweight mode.
Cheers, Thomas
On Thu, Nov 30, 2023 at 12:29 PM Thomas Stüfe <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>> wrote:
Hi Afshin,
I don't understand what the point of a lightweight nmt is if you cannot trust its numbers. The numbers you report would be wrong, so why do it at all?
range A = reserve 2G
range B = reserve 2G
commit 1G of range A
commit 1G of range B
release range A -> decrease by reserve size 2G? Your numbers now report "committed=0" whereas in reality, we still have 1G committed.
If its just about tracking memory allocation *requests*, that can be done a lot simpler, e.g. via logging.
Cheers, Thomas
On Thu, Nov 30, 2023 at 12:12 PM Afshin Zafari <afshin.zafari at oracle.com<mailto:afshin.zafari at oracle.com>> wrote:
Dear Thomas and Johan,
Thanks for your explanations. Some comments:
* The Lightweight mode is a new mode and does not change functionality of the summary or detail mode. Those modes can run as before.
* MEMFLAG argument is one of the missing data that if provided to NMT API, it is possible to do some tracking.
* In this state of the Lightweight design the numbers that are gathered are the requests of virtual memory reserve/commit/uncommit/release operations. The actual amount of the memory involved will be precise when case of overlapping regions is handled properly.
* With this version, end-users can find which parts of JVM has the most impact on total memory allocations.
* Becoming lightweight, enables NMT run as default.
Cheers,
Afshin
________________________________
From: hotspot-runtime-dev <hotspot-runtime-dev-retn at openjdk.org<mailto:hotspot-runtime-dev-retn at openjdk.org>> on behalf of Thomas Stüfe <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>>
Sent: 30 November 2023 08:51
To: Johan Sjolen <johan.sjolen at oracle.com<mailto:johan.sjolen at oracle.com>>; afshi.zafari at oracle.com<mailto:afshi.zafari at oracle.com> <afshi.zafari at oracle.com<mailto:afshi.zafari at oracle.com>>
Cc: Thomas Stuefe <tstuefe at redhat.com<mailto:tstuefe at redhat.com>>; hotspot-runtime-dev at openjdk.org<mailto:hotspot-runtime-dev at openjdk.org> <hotspot-runtime-dev at openjdk.org<mailto:hotspot-runtime-dev at openjdk.org>>
Subject: Re: New NMT Lightweight mode
Hi Afshin and Johan,
thank you for Afshins branch; now I have a clearer picture of what you are getting at. See remarks below.
> The summary mode today still requires the VirtualMemoryTracker to be active, but this is only necessary because of MEMFLAGS not being passed when releasing memory, only when reserving/committing it.
I think that is a misunderstanding, the MEMFLAG issue is incidental. We need the virtual memory tracker for correct accounting. You cannot just count sizes, since mmaps don't happen in a vacuum.
The most obvious example is release: you need to know the size and position of committed regions in the released region to correctly decrease commit numbers. As it is now, this proposal assumes on release that the whole range was committed. A released region can of course contain any mix of committed and uncommitted regions.
The same goes for commit: E.g. in Metaspace, we uncommit and commit ranges that may contain a mix of committed and uncommitted regions.
I still don't see how this could work with just counters.
A large concern of mine is that an emasculation of the Virtual memory tracker would be a direct blocker against the possibility of extending NMT coverage beyond the borders of hotspot, since mmap usage in other areas can be even less curated. Extending NMT would make many people happy, and it has been discussed repeatedly [1]. There is work happening in this direction; if this proposal were to go forward, it would inhibit that effort.
As a side effect, we would also lose the ability to determine MEMFLAG by pointer, so it disables recent work like https://bugs.openjdk.org/browse/JDK-8318636.
--
If this is really necessary, there are many alternatives that could improve performance while still tracking mmap regions.
Data-structure-wise, we could track regions e.g. in a BST, or, as proposed a flat, cache-optimized array. I am happy to see that you already work in that direction.
We also could do delayed bulk accounting. On mmap, journal the tracking info in a lock-free global list and process it in bulk later. This would introduce "staleness" to NMT reports and can interfere with functionality that expects tracking information inside NMT to be up-to-date. That is solveable, but let's first see if the performance gain is worth the complexity increase.
>Are you sure this is even worth optimizing? I doubt a typical jvm run involves more than a few dozen/hundred mmap calls. That is not exactly hot.
Are you sure? A ZGC enabled JVM can have a lot of NMT calls, and Metaspace does a lot of small and contiguous allocations. We do see a performance degradation with just summary mode.
Is the performance degradation during normal operation or during artificial micros?
Some time ago, I did performance measurements of NMT summary mode [2]. For normal operation, summary mode seemed not to matter. It only mattered to a small degree in a malloc-heavy benchmark, but that's because of the mallocs, not because of mmap. Therefore I am not convinced yet that optimizing the mmap path is worth the cost for this proposal. E.g. I would have thought that with ZGC, by the time we amass millions of regions, the CPU is burned elsewhere.
I am not against performance improvements. I just really dislike the idea of limiting NMTs ability to track mmap, since I think that would go into the wrong direction.
Cheers, Thomas
[1] https://mail.openjdk.org/pipermail/core-libs-dev/2022-November/096197.html
[2] https://stackoverflow.com/questions/73126185/what-is-overhead-of-java-native-memory-tracking-in-summary-mode/73175766#73175766<https://urldefense.com/v3/__https://stackoverflow.com/questions/73126185/what-is-overhead-of-java-native-memory-tracking-in-summary-mode/73175766*73175766__;Iw!!ACWV5N9M2RV99hQ!L6l6nbwTnoGFEoPUn0t2EGQcR_9y3xazBQY5dM87Ts2pjn8T8GciutahGgpa4-bfO4B_R18i4ausP18cVgt0NPnSGWA$>
On Wed, Nov 29, 2023 at 11:24 PM Johan Sjolen <johan.sjolen at oracle.com<mailto:johan.sjolen at oracle.com>> wrote:
Hi Thomas,
>You are talking about rewriting the NMT backend tracking virtual memory regions lock-free, right
Only VirtualMemorySummary, not VirtualMemoryTracker!
>Are you sure this is even worth optimizing? I doubt a typical jvm run involves more than a few dozen/hundred mmap calls. That is not exactly hot.
> Then, are you sure that swapping a lock synchronization with a bunch of atomic variable updates brings that much performance gain? Atomic updates are still expensive.
>Do we even have much contention? I would have thought that concurrently happening mmap calls are very rare.
>Reducing contention can be done a lot simpler, first by swapping ThreadCritical for a targeted mutex as we always wanted.
Unfortunately, Afshin did not list his branch in the e-mail, here's a link :https://github.com/afshin-zafari/jdk/commits/_nmt_light<https://urldefense.com/v3/__https://github.com/afshin-zafari/jdk/commits/_nmt_light__;!!ACWV5N9M2RV99hQ!L6l6nbwTnoGFEoPUn0t2EGQcR_9y3xazBQY5dM87Ts2pjn8T8GciutahGgpa4-bfO4B_R18i4ausP18cVgt0fXFD03Y$>
Let's make sure that we're not talking past each other: What Afshin is proposing is replacing the summary mode tracking. We do have some preliminary performance results showing that these are effective (see second commit in branch).
Basic idea is this:
The summary mode today still requires the VirtualMemoryTracker to be active, but this is only necessary because of MEMFLAGS not being passed when releasing memory, only when reserving/committing it. If you add MEMFLAGs as a parameter to release_memory etc, then summary mode is independent of the VMT and the rest of Afshin's patch are natural progressions from the consequences of that.
>If that is not good enough, change the underlying data structure. A linked list of regions is not that effective. There are data structures that are better suited for this, e.g. interval trees.
>But a very simple and possibly very effective change could be to change the structure from a list to an array. That's a lot more cache friendly. If you optimise it right (e.g. similar to the "CachedNMTInformation" cache here: https://github.com/tstuefe/jdk/blob/24a9916aedecbde76f5c6d79c001133b750f431d/src/hotspot/share/nmt/memMapPrinter.cpp#L84<https://urldefense.com/v3/__https://github.com/tstuefe/jdk/blob/24a9916aedecbde76f5c6d79c001133b750f431d/src/hotspot/share/nmt/memMapPrinter.cpp*L84__;Iw!!ACWV5N9M2RV99hQ!L6l6nbwTnoGFEoPUn0t2EGQcR_9y3xazBQY5dM87Ts2pjn8T8GciutahGgpa4-bfO4B_R18i4ausP18cVgt0qlUPbAY$> ) this can be orders of magnitude faster.
Yes, I have a couple of branches for this work. You can find the most current one in https://github.com/jdksjolen/jdk/tree/nmt-memory-view<https://urldefense.com/v3/__https://github.com/jdksjolen/jdk/tree/nmt-memory-view__;!!ACWV5N9M2RV99hQ!L6l6nbwTnoGFEoPUn0t2EGQcR_9y3xazBQY5dM87Ts2pjn8T8GciutahGgpa4-bfO4B_R18i4ausP18cVgt0S-akxng$>
I really like Afshin's patch because it will also help this work. Especially, if release_memory etc pass along the MEMFLAGS then we can use this information to skip scanning a lot of regions.
So, I see this patch as differentiating further between summary and detailed mode and also helping out writing a faster virtual memory tracker.
>Are you sure this is even worth optimizing? I doubt a typical jvm run involves more than a few dozen/hundred mmap calls. That is not exactly hot.
Are you sure? A ZGC enabled JVM can have a lot of NMT calls, and Metaspace does a lot of small and contiguous allocations. We do see a performance degradation with just summary mode.
I hope that this brings some clarity to the issue.
All the best,
Johan
________________________________
From: hotspot-runtime-dev <hotspot-runtime-dev-retn at openjdk.org<mailto:hotspot-runtime-dev-retn at openjdk.org>> on behalf of Thomas Stuefe <tstuefe at redhat.com<mailto:tstuefe at redhat.com>>
Sent: Wednesday, November 29, 2023 19:20
To: Afshin Zafari <afshin.zafari at oracle.com<mailto:afshin.zafari at oracle.com>>
Cc: hotspot-runtime-dev at openjdk.org<mailto:hotspot-runtime-dev at openjdk.org> <hotspot-runtime-dev at openjdk.org<mailto:hotspot-runtime-dev at openjdk.org>>
Subject: Re: New NMT Lightweight mode
Hi Ashin,
I did not see your mail, therefore I replied in JBS.
Cheers, Thomas
On Wed, Nov 29, 2023 at 6:09 PM Afshin Zafari <afshin.zafari at oracle.com<mailto:afshin.zafari at oracle.com>> wrote:
Dear all,
The JBS RFE https://bugs.openjdk.org/browse/JDK-8320977 is created for implementing a new Lightweight mode for NativeMemoryTracking (NMT) of JDK.
Since the changes for this RFE impact a large number of files and code in JDK, it would be great if I get comments while prototyping it.
Best,
Afshin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-runtime-dev/attachments/20231201/80594522/attachment-0001.htm>
More information about the hotspot-runtime-dev
mailing list