RFR: JDK-8293114: JVM should trim the native heap [v8]

Aleksey Shipilev shade at openjdk.org
Wed Jul 12 08:44:27 UTC 2023


On Mon, 10 Jul 2023 13:53:36 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> This is a continuation of https://github.com/openjdk/jdk/pull/10085. I closed https://github.com/openjdk/jdk/pull/10085 because it had accumulated too much comment history and got confusing. For a history of this issue, see previous discussions [1] and the comment section of 10085.
>>  
>> ---------------
>> 
>> This RFE adds the option to trim the Glibc heap periodically. This can recover a significant memory footprint if the VM process suffers from high-but-rare malloc spikes. It does not matter who causes the spikes: the JDK or customer code running in the JVM process.
>> 
>> ### Background:
>> 
>> The Glibc is reluctant to return memory to the OS. Temporary malloc spikes often carry over as permanent RSS increase. Note that C-heap retention is difficult to observe. Since it is freed memory, it won't appear in NMT; it is just a part of RSS.
>> 
>> This is, effectively, caching - a performance tradeoff by the glibc. It makes a lot of sense with applications that cause high traffic on the C-heap. The JVM, however, clusters allocations and often rolls its own memory management based on virtual memory for many of its use cases.
>> 
>> To manually trim the C-heap, Glibc exposes `malloc_trim(3)`. With JDK 18 [2], we added a new jcmd command to *manually* trim the C-heap on Linux (`jcmd System.trim_native_heap`). We then observed customers running this command periodically to slim down process sizes of container-bound jvms. That is cumbersome, and the JVM can do this a lot better - among other things because it knows best when *not* to trim.
>> 
>> #### GLIBC internals
>> 
>> The following information I took from the glibc source code and experimenting.
>> 
>> ##### Why do we need to trim manually? Does the Glibc not trim on free?
>> 
>> Upon `free()`, glibc may return memory to the OS if:
>> - the returned block was mmap'ed
>> - the returned block was not added to tcache or to fastbins
>> - the returned block, possibly merged with its two immediate neighbors, had they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in that case:
>>   a) for the main arena, glibc attempts to lower the brk()
>>   b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the heap.
>> In both cases, (a) and (b), only the top portion of the heap is reclaimed. "Holes" in the middle of other in-use chunks are not reclaimed.
>> 
>> So: glibc *may* automatically reclaim memory. In normal configurations, with typical C-heap allocation granularity, it is unlikely.
>> 
>> To increase the ...
>
> Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 32 additional commits since the last revision:
> 
>  - Make test spikes more pronounced
>  - Dont query procfs if logging is off
>  - rename logtag again
>  - When probing for safepoint end, use the smaller of (interval, 250ms)
>  - Remove TrimNativeHeap and expand TrimNativeHeapInterval
>  - Improve comments for non-supportive platforms
>  - Aleksey cosmetics
>  - suspend count return 16 bits
>  - Fix linker errors
>  - Merge branch 'master' into JDK-8293114-JVM-should-trim-the-native-heap
>  - ... and 22 more: https://git.openjdk.org/jdk/compare/417c1c87...15566761

Hopefully a final read. I think there are minor things left, see comments. Alternatively, apply this patch over your current PR, which contains fixes for my comments, and then some polishing: [trimnative-shipilev-1.patch](https://github.com/openjdk/jdk/files/12025768/trimnative-shipilev-1.patch)

...also, Windows builds are failing.

src/hotspot/share/runtime/trimNativeHeap.cpp line 47:

> 45: 
> 46:   // Statistics
> 47:   unsigned _num_trims_performed;

Sorry for the nit, but this is `uint16_t` too then, for consistency?

src/hotspot/share/runtime/trimNativeHeap.cpp line 75:

> 73:         SafepointSynchronize::is_synchronizing();
> 74:   }
> 75:   static constexpr int64_t safepoint_poll_ms = 250;

Let's document this a little:



// Upper limit for the backoff during pending/in-progress safepoint.
// Chosen as reasonable value to balance the overheads of waking up
// during the safepoint, which might have undesired effects on latencies,
// and the accuracy in tracking the trimming interval.
static constexpr int64_t safepoint_poll_ms = 250;

src/hotspot/share/runtime/trimNativeHeap.cpp line 90:

> 88:     assert(NativeHeapTrimmer::enabled(), "Only call if enabled");
> 89: 
> 90:     LogStartStopMark logStartStop;

Hotspot style: no camel case for local identifiers.

src/hotspot/share/runtime/trimNativeHeap.cpp line 110:

> 108:           } else if (at_or_nearing_safepoint()) {
> 109:             const int64_t wait_ms = MIN2((int64_t)TrimNativeHeapInterval, safepoint_poll_ms);
> 110:             ml.wait(safepoint_poll_ms);

`MIN2<int64_t>(..., ...)` might work better without a cast?
Also, let's actually use `wait_ms` here :)

src/hotspot/share/runtime/trimNativeHeap.cpp line 117:

> 115:           tnow = now();
> 116: 
> 117:         } while (at_or_nearing_safepoint() || is_suspended() || next_trim_time > tnow);

Maybe invert this to pre-condition while?

src/hotspot/share/runtime/trimNativeHeap.cpp line 120:

> 118:       } // Lock scope
> 119: 
> 120:       // 2 - Trim outside of lock protection.

There is no `1 -` to match this `2 -` to.

src/hotspot/share/runtime/trimNativeHeap.hpp line 46:

> 44:   static void cleanup();
> 45: 
> 46:   static uint64_t num_trims_performed();

Need this for anything? Does not seem to be implemented.

test/hotspot/jtreg/runtime/os/TestTrimNative.java line 34:

> 32:  * @build jdk.test.whitebox.WhiteBox
> 33:  * @run driver jdk.test.lib.helpers.ClassFileInstaller jdk.test.whitebox.WhiteBox
> 34:  * @run main/othervm -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI TestTrimNative trimNative

I see that we always spawn a VM with Whitebox enabled explicitly there. Do you need to enable Whitebox for these? Also, can these be just `@run driver`?

-------------

PR Review: https://git.openjdk.org/jdk/pull/14781#pullrequestreview-1522725836
PR Comment: https://git.openjdk.org/jdk/pull/14781#issuecomment-1632093967
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258705275
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258833848
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258835428
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1260797898
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1260809509
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1260799630
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258840799
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1260764210


More information about the serviceability-dev mailing list