New NMT Lightweight mode

Fri Dec 1 07:32:32 UTC 2023

Some more thoughts:

I wondered how your lightweight mode would track malloc sizes, and do so
more cheaply than summary mode.

You cannot use malloc headers since this would increase the footprint (by a
large amount in corner cases) and would be unacceptable for an always-on
"lightweight" mode.

If you plan on forcing every caller of free() to hand in the size, this
would be really ugly and cumbersome. In many cases it would not even work,
e.g. for users of Unsafe.freeMemory().

But even if you make malloc tracking work, I suspect it will not be faster
than today's summary mode since you don't do anything different compared to
summary mode. You increase atomic counters, lockfree - summary mode does
the same. So you will pay the same cost as summary mode. But that cost
absolutely dominates - see
https://stackoverflow.com/questions/73126185/what-is-overhead-of-java-native-memory-tracking-in-summary-mode/73175766#73175766):
the examples that were costly in summary were malloc-heavy scenarios. That
makes sense since the number of malloc calls outweighs the number of mmap
calls by orders of magnitude.

Therefore my assumption is that in malloc-heavy scenarios, the performance
cost for lightweight is the same as for summary - a strong argument against
enabling lightweight by default. The only performance advantage we have for
processing mmap calls, which are rare in comparison; and that performance
gain comes at the cost of wildly inaccurate mmap numbers.

I also think rolling out a tool that reports numbers you cannot rely on is
not good. Wrong numbers are worse than no numbers.

Sorry for being so insistent about this - I really think the investment in
a new lightweight mode, with the complexity it would bring and the new
restrictions wrt API usage across the code base, would be better spent by
making the existing summary mode faster. That mode was meant to be the
lightweight mode.

Cheers, Thomas

On Thu, Nov 30, 2023 at 12:29 PM Thomas Stüfe <thomas.stuefe at gmail.com>
wrote:

> Hi Afshin,
>
> I don't understand what the point of a lightweight nmt is if you cannot
> trust its numbers. The numbers you report would be wrong, so why do it at
> all?
>
> range A = reserve 2G
> range B = reserve 2G
> commit 1G of range A
> commit 1G of range B
> release range A -> decrease by reserve size 2G? Your numbers now report
> "committed=0" whereas in reality, we still have 1G committed.
>
> If its just about tracking memory allocation *requests*, that can be done
> a lot simpler, e.g. via logging.
>
> Cheers, Thomas
>
> On Thu, Nov 30, 2023 at 12:12 PM Afshin Zafari <afshin.zafari at oracle.com>
> wrote:
>
>> Dear Thomas and Johan,
>> Thanks for your explanations. Some comments:
>>
>>    - The Lightweight mode is a new mode and does not change
>>    functionality of the summary or detail mode. Those modes can run as before.
>>    - MEMFLAG argument is one of the missing data that if provided to NMT
>>    API, it is possible to do some tracking.
>>    - In this state of the Lightweight design the numbers that are
>>    gathered are the *requests *of virtual memory
>>    reserve/commit/uncommit/release operations. The actual amount of the memory
>>    involved will be precise when case of overlapping regions is handled
>>    properly.
>>    - With this version, end-users can find which parts of JVM has the
>>    most impact on total memory allocations.
>>    - Becoming lightweight, enables NMT run as default.
>>
>>
>> Cheers,
>> Afshin
>> ------------------------------
>> *From:* hotspot-runtime-dev <hotspot-runtime-dev-retn at openjdk.org> on
>> behalf of Thomas Stüfe <thomas.stuefe at gmail.com>
>> *Sent:* 30 November 2023 08:51
>> *To:* Johan Sjolen <johan.sjolen at oracle.com>; afshi.zafari at oracle.com <
>> afshi.zafari at oracle.com>
>> *Cc:* Thomas Stuefe <tstuefe at redhat.com>; hotspot-runtime-dev at openjdk.org
>> <hotspot-runtime-dev at openjdk.org>
>> *Subject:* Re: New NMT Lightweight mode
>>
>> Hi Afshin and Johan,
>>
>> thank you for Afshins branch; now I have a clearer picture of what you
>> are getting at. See remarks below.
>>
>> *> The summary mode today still requires the VirtualMemoryTracker to be
>> active, but this is only necessary because of MEMFLAGS not being passed
>> when releasing memory, only when reserving/committing it.*
>>
>> I think that is a misunderstanding, the MEMFLAG issue is incidental. We
>> need the virtual memory tracker for correct accounting. You cannot just
>> count sizes, since mmaps don't happen in a vacuum.
>>
>> The most obvious example is release: you need to know the size and
>> position of committed regions in the released region to correctly decrease
>> commit numbers. As it is now, this proposal assumes on release that the
>> whole range was committed. A released region can of course contain any mix
>> of committed and uncommitted regions.
>>
>> The same goes for commit: E.g. in Metaspace, we uncommit and commit
>> ranges that may contain a mix of committed and uncommitted regions.
>>
>> I still don't see how this could work with just counters.
>>
>> A large concern of mine is that an emasculation of the Virtual memory
>> tracker would be a direct blocker against the possibility of extending NMT
>> coverage beyond the borders of hotspot, since mmap usage in other areas can
>> be even less curated. Extending NMT would make many people happy, and it
>> has been discussed repeatedly [1]. There is work happening in this
>> direction; if this proposal were to go forward, it would inhibit that
>> effort.
>>
>> As a side effect, we would also lose the ability to determine MEMFLAG by
>> pointer, so it disables recent work like
>> https://bugs.openjdk.org/browse/JDK-8318636.
>>
>> --
>>
>> If this is really necessary, there are many alternatives that could
>> improve performance while still tracking mmap regions.
>>
>> Data-structure-wise, we could track regions e.g. in a BST, or, as
>> proposed a flat, cache-optimized array. I am happy to see that you already
>> work in that direction.
>>
>> We also could do delayed bulk accounting. On mmap, journal the tracking
>> info in a lock-free global list and process it in bulk later. This would
>> introduce "staleness" to NMT reports and can interfere with functionality
>> that expects tracking information inside NMT to be up-to-date. That is
>> solveable, but let's first see if the performance gain is worth the
>> complexity increase.
>>
>>
>> *>Are you sure this is even worth optimizing? I doubt a typical jvm run
>> involves more than a few dozen/hundred mmap calls. That is not exactly hot.
>> Are you sure? A ZGC enabled JVM can have a lot of NMT calls, and Metaspace
>> does a lot of small and contiguous allocations. We do see a performance
>> degradation with just summary mode.*
>>
>> Is the performance degradation during normal operation or during
>> artificial micros?
>>
>> Some time ago, I did performance measurements of NMT summary mode [2].
>> For normal operation, summary mode seemed not to matter. It only mattered
>> to a small degree in a malloc-heavy benchmark, but that's because of the
>> mallocs, not because of mmap. Therefore I am not convinced yet that
>> optimizing the mmap path is worth the cost for this proposal. E.g. I would
>> have thought that with ZGC, by the time we amass millions of regions, the
>> CPU is burned elsewhere.
>>
>> I am not against performance improvements. I just really dislike the idea
>> of limiting NMTs ability to track mmap, since I think that would go into
>> the wrong direction.
>>
>> Cheers, Thomas
>>
>> [1]
>> https://mail.openjdk.org/pipermail/core-libs-dev/2022-November/096197.html
>> [2]
>> https://stackoverflow.com/questions/73126185/what-is-overhead-of-java-native-memory-tracking-in-summary-mode/73175766#73175766
>>
>>
>>
>>
>> On Wed, Nov 29, 2023 at 11:24 PM Johan Sjolen <johan.sjolen at oracle.com>
>> wrote:
>>
>> Hi Thomas,
>>
>> >You are talking about rewriting the NMT backend tracking virtual memory
>> regions lock-free, right
>> Only VirtualMemorySummary, not VirtualMemoryTracker!
>>
>> >Are you sure this is even worth optimizing? I doubt a typical jvm run
>> involves more than a few dozen/hundred mmap calls. That is not exactly hot.
>> > Then, are you sure that swapping a lock synchronization with a bunch of
>> atomic variable updates brings that much performance gain? Atomic updates
>> are still expensive.
>> >Do we even have much contention? I would have thought that concurrently
>> happening mmap calls are very rare.
>> >Reducing contention can be done a lot simpler, first by swapping
>> ThreadCritical for a targeted mutex as we always wanted.
>>
>> Unfortunately, Afshin did not list his branch in the e-mail, here's a
>> link :https://github.com/afshin-zafari/jdk/commits/_nmt_light
>>
>> Let's make sure that we're not talking past each other: What Afshin is
>> proposing is replacing the summary mode tracking. We do have some
>> preliminary performance results showing that these are effective (see
>> second commit in branch).
>> Basic idea is this:
>> The summary mode today still requires the VirtualMemoryTracker to be
>> active, but this is only necessary because of MEMFLAGS not being passed
>> when releasing memory, only when reserving/committing it. If you add
>> MEMFLAGs as a parameter to release_memory etc, then summary mode is
>> independent of the VMT and the rest of Afshin's patch are natural
>> progressions from the consequences of that.
>>
>> >If that is not good enough, change the underlying data structure. A
>> linked list of regions is not that effective. There are data structures
>> that are better suited for this, e.g. interval trees.
>> >But a very simple and possibly very effective change could be to change
>> the structure from a list to an array. That's a lot more cache friendly. If
>> you optimise it right (e.g. similar to the "CachedNMTInformation" cache
>> here:
>> https://github.com/tstuefe/jdk/blob/24a9916aedecbde76f5c6d79c001133b750f431d/src/hotspot/share/nmt/memMapPrinter.cpp#L84 )
>> this can be orders of magnitude faster.
>>
>> Yes, I have a couple of branches for this work. You can find the most
>> current one in https://github.com/jdksjolen/jdk/tree/nmt-memory-view
>> I really like Afshin's patch because it will also help this work.
>> Especially, if release_memory etc pass along the MEMFLAGS then we can use
>> this information to skip scanning a lot of regions.
>>
>> So, I see this patch as differentiating further between summary and
>> detailed mode and also helping out writing a faster virtual memory tracker.
>>
>> >Are you sure this is even worth optimizing? I doubt a typical jvm run
>> involves more than a few dozen/hundred mmap calls. That is not exactly hot.
>> Are you sure? A ZGC enabled JVM can have a lot of NMT calls, and
>> Metaspace does a lot of small and contiguous allocations. We do see a
>> performance degradation with just summary mode.
>>
>> I hope that this brings some clarity to the issue.
>>
>> All the best,
>> Johan
>> ------------------------------
>> *From:* hotspot-runtime-dev <hotspot-runtime-dev-retn at openjdk.org> on
>> behalf of Thomas Stuefe <tstuefe at redhat.com>
>> *Sent:* Wednesday, November 29, 2023 19:20
>> *To:* Afshin Zafari <afshin.zafari at oracle.com>
>> *Cc:* hotspot-runtime-dev at openjdk.org <hotspot-runtime-dev at openjdk.org>
>> *Subject:* Re: New NMT Lightweight mode
>>
>> Hi Ashin,
>>
>> I did not see your mail, therefore I replied in JBS.
>>
>> Cheers, Thomas
>>
>> On Wed, Nov 29, 2023 at 6:09 PM Afshin Zafari <afshin.zafari at oracle.com>
>> wrote:
>>
>> Dear all,
>> The JBS RFE https://bugs.openjdk.org/browse/JDK-8320977 is created for
>> implementing a new Lightweight mode for NativeMemoryTracking (NMT) of JDK.
>> Since the changes for this RFE impact a large number of files and code in
>> JDK, it would be great if I get comments while prototyping it.
>>
>> Best,
>> Afshin
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-runtime-dev/attachments/20231201/460dbdb7/attachment-0001.htm>