RFR: JDK-8293114: GC should trim the native heap [v7]

Wed Feb 1 10:22:44 UTC 2023

> This RFE adds the option to auto-trim the Glibc heap as part of the GC cycle. If the VM process suffered high temporary malloc spikes (regardless whether from JVM- or user code), this could recover significant amounts of memory.
> 
> We discussed this a year ago [1], but the item got pushed to the bottom of my work pile, therefore it took longer than I thought.
> 
> ### Motivation
> 
> The Glibc allocator is reluctant to return memory to the OS, much more so than other allocators. Temporary malloc spikes often carry over as permanent RSS increase. 
> 
> Note that C-heap retention is difficult to observe. Since it is freed memory, it won't show up in NMT, it is just a part of private RSS.
> 
> Theoretically, retained memory is not lost since it will be reused by future mallocs. Retaining memory is therefore a bet on the future behavior of the app. The allocator bets on the application needing memory in the near future, and to satisfy that need via malloc. 
> 
> But an app's malloc load can fluctuate wildly, with temporary spikes and long idle periods. And if the app rolls its own custom allocators atop of mmap, as hotspot does, a lot of that memory cannot be reused even though it counts toward its memory footprint.
> 
> To help, Glibc exports an API to trim the C-heap: `malloc_trim(3)`. With JDK 18 [2], SAP contributed a new jcmd command to *manually* trim the C-heap on Linux. This RFE adds a complementary way to trim automatically.
> 
> #### Is this even a problem?
> 
> Do we have high malloc spikes in the JVM process? We assume that malloc load from hotspot is usually low since hotspot typically clusters allocations into custom areas - metaspace, code heap, arenas. 
> 
> But arenas are subject to Glibc mem retention too. I was surprised by that since I assumed 32k arena chunks were too big to be subject of Glibc retention. But I saw in experiments that high arena peaks often cause lasting RSS increase. 
> 
> And of course, both hotspot and JDK do a lot of finer-granular mallocs outside of custom allocators.
> 
> But many cases of high memory retention in Glibc I have seen in third-party JNI code. Libraries allocate large buffers via malloc as temporary buffers. In fact, since we introduced the jcmd "System.trim_native_heap", some of our customers started to call this command periodically in scripts to counter these issues.
> 
> Therefore I think while high malloc spikes are atypical for a JVM process, they can happen. Having a way to auto-trim the native heap makes sense.
> 
> ### When should we trim?
> 
> We want to trim when we know there is a lull in malloc activity coming. But we have no knowledge of the future.
> 
> We could build a heuristic based on malloc frequency. But on closer inspection that is difficult. We cannot use NMT, since NMT has no complete picture (only knows hotspot) and is usually disabled in production anyway. The only way to get *all* mallocs would be to use Glibc malloc hooks. We have done so in desperate cases at SAP, but Glibc removed malloc hooks in 2.35. It would be a messy solution anyway; best to avoid it.
> 
> The next best thing is synchronizing with the larger C-heap users in the VM: compiler and GC. But compiler turns out not to be such a problem, since the compiler uses arenas, and arena chunks are buffered in a free pool with a five-second delay. That means compiler activity that happens in bursts, like at VM startup, will just shuffle arena chunks around from/to the arena free pool, never bothering to call malloc or free.
> 
> That leaves the GC, which was also the experts' recommendation in last year's discussion [1]. Most GCs do uncommit, and trimming the native heap fits well into this. And we want to time the trim to not get into the way of a GC. Plus, integrating trims into the GC cycle lets us reuse GC logging and timing, thereby making RSS changes caused by trim-native visible to the analyst.
> 
> 
> ### How it works:
> 
> Patch adds new options (experimental for now, and shared among all GCs):
> 
> 
> -XX:+GCTrimNativeHeap
> -XX:GCTrimNativeHeapInterval=<seconds> (defaults to 60)
> 
> 
> `GCTrimNativeHeap` is off by default. If enabled, it will cause the VM to trim the native heap on full GCs as well as periodically. The period is defined by `GCTrimNativeHeapInterval`. Periodic trimming can be completely switched off with `GCTrimNativeHeapInterval=0`; in that case, we will only trim on full GCs.
> 
> ### Examples:
> 
> This is an artificial test that causes two high malloc spikes with long idle periods. Observe how RSS recovers with trim but stays up without trim. The trim interval was set to 15 seconds for the test, and no GC was invoked here; this is periodic trimming.
> 
> ![alloc-test](http://cr.openjdk.java.net/~stuefe/other/autotrim/rss-all-collectors.png)
> 
> (See here for parameters: [run script](http://cr.openjdk.java.net/~stuefe/other/autotrim/run-all.sh) )
> 
> Spring pet clinic boots up, then idles. Once with, once without trim, with the trim interval at 60 seconds default. Of course, if it were actually doing something instead of idling, trim effects would be smaller. But the point of trimming is to recover memory in idle periods.
> 
> ![petclinic bootup](http://cr.openjdk.java.net/~stuefe/other/autotrim/spring-petclinic-rss-with-and-without-trim.png))
> 
> (See here for parameters: [run script](http://cr.openjdk.java.net/~stuefe/other/autotrim/run-petclinic-boot.sh) )
> 
> 
> 
> ### Implementation
> 
> One problem I faced when implementing this was that trimming was non-interruptable. GCs usually split the uncommit work into smaller portions, which is impossible for `malloc_trim()`.
> 
> So very slow trims could introduce longer GC pauses. I did not want this, therefore I implemented two ways to trim:
> 1) GCs can opt to trim asynchronously. In that case, a `NativeTrimmer` thread runs on behalf of the GC and takes care of all trimming. The GC just nudges the `NativeTrimmer` at the end of its GC cycle, but the trim itself runs concurrently.
> 2) GCs can do the trim inside their own thread, synchronously. It will have to wait until the trim is done.
> 
> (1) has the advantage of giving us periodic trims even without GC activity (Shenandoah does this out of the box).
> 
> #### Serial
> 
> Serial does the trimming synchronously as part of a full GC, and only then. I did not want to spawn a separate thread for the SerialGC. Therefore Serial is the only GC that does not offer periodic trimming, it just trims on full GC.
> 
> #### Parallel, G1, Z
> 
> All of them do the trimming asynchronously via `NativeTrimmer`. They schedule the native trim at the end of a full collection. They also pause the trimming at the beginning of a cycle to not trim during GCs.
> 
> #### Shenandoah
> 
> Shenandoah does the trimming synchronously in its service thread, similar to how it handles uncommits. Since the service thread already runs concurrently and continuously, it can do periodic trimming; no need to spin a new thread. And this way we can reuse the Shenandoah timing classes.
> 
> ### Patch details
> 
> - adds three new functions to the `os` namespace:
>   - `os::trim_native_heap()` implementing trim
>   - `os::can_trim_native_heap()` and `os::should_trim_native_heap()` to return whether platform supports trimming resp. whether the platform considers trimming to be useful.
> - replaces implementation of the cmd "System.trim_native_heap" with the new `os::trim_native_heap`
> - provides a new wrapper function wrapping the tedious `mallinfo()` vs `mallinfo2()` business: `os::Linux::get_mallinfo()`
> - adds a GC-shared utility class, `GCTrimNative`, that takes care of trimming and GC-logging and houses the `NativeTrimmer` thread class.
> - adds a regression test
> 
> 
> ### Tests
> 
> Tested older Glibc (2.31), and newer Glibc (2.35) (`mallinfo()` vs` mallinfo2()`), on Linux x64.
> 
> The rest of the tests will be done by GHA and in our SAP nightlies.
> 
> 
> ### Remarks
> 
> #### How about other allocators?
> 
> I have seen this retention problem mainly with the Glibc and the AIX libc. Muslc returns memory more eagerly to the OS. I also tested with jemalloc and found it also reclaims more aggressively, therefore I don't think MacOS or BSD are affected that much by retention either.
> 
> #### Trim costs?
> 
> Trim-native is a tradeoff between memory and performance. We pay
> - The cost to do the trim depends on how much is trimmed. Time ranges on my machine between < 1ms for no-op trims, to ~800ms for 32GB trims.
> - The cost for re-acquiring the memory, should the memory be needed again, is the second cost factor.
> 
> #### Predicting malloc_trim effects?
> 
> `ShenandoahUncommit` avoids uncommits if they are not necessary, thus avoiding work and gc log spamming. I liked that and tried to follow that example. Tried to devise a way to predict the effect trim could have based on allocator info from mallinfo(3). That was quite frustrating since the documentation was confusing and I had to do a lot of experimenting. In the end, I came up with a heuristic to prevent obviously pointless trim attempts; see `os::should_trim_native_heap()`.  I am not completely happy with it.
> 
> #### glibc.malloc.trim_threshold?
> 
> glibc has a tunable that looks like it could influence the willingness of Glibc to return memory to the OS, the "trim_threshold". In practice, I could not get it to do anything useful. Regardless of the setting, it never seemed to influence the trimming behavior. Even if it would work, I'm not sure we'd want to use that, since by doing malloc_trim manually we can space out the trims as we see fit, instead of paying the trim price for free(3).
> 
> 
> - [1] https://mail.openjdk.org/pipermail/hotspot-dev/2021-August/054323.html
> - [2] https://bugs.openjdk.org/browse/JDK-8269345

Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits:

 - wip
 - wip
 - wip
 - Merge branch 'master' into JDK-8293114-GC-trim-native
 - wip
 - Merge branch 'master' into JDK-8293114-GC-trim-native
 - revamp wip
 - Merge branch 'master' into JDK-8293114-GC-trim-native
 - make tests for ppc more lenient
 - Merge
 - ... and 13 more: https://git.openjdk.org/jdk/compare/af564e46...599573d9

-------------

Changes: https://git.openjdk.org/jdk/pull/10085/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10085&range=06
  Stats: 1069 lines in 23 files changed: 1065 ins; 1 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/10085.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10085/head:pull/10085

PR: https://git.openjdk.org/jdk/pull/10085