RFR: JDK-8293114: JVM should trim the native heap [v7]

Mon Jul 10 11:12:23 UTC 2023

On Sat, 8 Jul 2023 08:01:32 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> This is a continuation of https://github.com/openjdk/jdk/pull/10085. I closed https://github.com/openjdk/jdk/pull/10085 because it had accumulated too much comment history and got confusing. For a history of this issue, see previous discussions [1] and the comment section of 10085.
>>  
>> ---------------
>> 
>> This RFE adds the option to trim the Glibc heap periodically. This can recover a significant memory footprint if the VM process suffers from high-but-rare malloc spikes. It does not matter who causes the spikes: the JDK or customer code running in the JVM process.
>> 
>> ### Background:
>> 
>> The Glibc is reluctant to return memory to the OS. Temporary malloc spikes often carry over as permanent RSS increase. Note that C-heap retention is difficult to observe. Since it is freed memory, it won't appear in NMT; it is just a part of RSS.
>> 
>> This is, effectively, caching - a performance tradeoff by the glibc. It makes a lot of sense with applications that cause high traffic on the C-heap. The JVM, however, clusters allocations and often rolls its own memory management based on virtual memory for many of its use cases.
>> 
>> To manually trim the C-heap, Glibc exposes `malloc_trim(3)`. With JDK 18 [2], we added a new jcmd command to *manually* trim the C-heap on Linux (`jcmd System.trim_native_heap`). We then observed customers running this command periodically to slim down process sizes of container-bound jvms. That is cumbersome, and the JVM can do this a lot better - among other things because it knows best when *not* to trim.
>> 
>> #### GLIBC internals
>> 
>> The following information I took from the glibc source code and experimenting.
>> 
>> ##### Why do we need to trim manually? Does the Glibc not trim on free?
>> 
>> Upon `free()`, glibc may return memory to the OS if:
>> - the returned block was mmap'ed
>> - the returned block was not added to tcache or to fastbins
>> - the returned block, possibly merged with its two immediate neighbors, had they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in that case:
>>   a) for the main arena, glibc attempts to lower the brk()
>>   b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the heap.
>> In both cases, (a) and (b), only the top portion of the heap is reclaimed. "Holes" in the middle of other in-use chunks are not reclaimed.
>> 
>> So: glibc *may* automatically reclaim memory. In normal configurations, with typical C-heap allocation granularity, it is unlikely.
>> 
>> To increase the ...
>
> Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Add test with 1ms trim interval
>  - No need for atomics

This is a nice progress! Another read follow.

Bikeshedding: I think we also need to decide how we call this thing (and related symbols): "native heap trim" or "trim native heap". AFAICT, from there, the log tag name follows, the options name follow, the VM symbol names follow.

src/hotspot/share/logging/logTag.hpp line 199:

> 197:   LOG_TAG(tlab) \
> 198:   LOG_TAG(tracking) \
> 199:   LOG_TAG(trimnh) /* trim native heap */ \

`nh` is confusing. `trimnative`, `nativetrim`, or `nativeheaptrim`? I think there is a precedent for long multi-word tags with `valuebasedclasses` :)

src/hotspot/share/runtime/globals.hpp line 1990:

> 1988:           "If TrimNativeHeap is enabled: interval, in ms, at which "        \
> 1989:           "the to trim the native heap.")                                   \
> 1990:           range(1, UINT_MAX)                                                \

I have a suggestion that simplifies UX, I think. In other cases where we do these kinds of intervals, we just use `0` as the "off" value. See for example `AsyncDeflationInterval`, `GuaranteedSafepointInterval`. This would allow users to supply one option only. 

We would then need to decide if we turn this thing on by default (probably with large interval), or we default to `0` for "off". I would prefer to go for `0`. I understand that would force users to decide on proper trim interval when enabling, but I think that's a feature, not a bug. We would not have to go into discussions if the 60 second default is good enough or not.

Something like this:

  product(uintx, TrimNativeHeapInterval, 0, EXPERIMENTAL,                   \
          “Attempt to trim the native heap every so many milliseconds, “    \
          “if platform supports it. Lower values provide better footprint “ \
          “under native allocation spikes, while higher values come with “  \
          “less overhead. Use 0 to disable trimming.”                       \

src/hotspot/share/runtime/trimNativeHeap.cpp line 54:

> 52:   }
> 53: 
> 54:   unsigned inc_suspend_count() {

If `_suspend_count` is `uint16_t`, the methods that use it should also return `uint16_t`?

src/hotspot/share/runtime/trimNativeHeap.cpp line 79:

> 77:   // in seconds
> 78:   static double now() { return os::elapsedTime(); }
> 79:   static double to_ms(double seconds) { return seconds * 1000.0; }

Would you like to just do it in `int64_t` representing milliseconds? The common way to get it is `nanos_to_millis(os::elapsed_counter())`.

Turning this to integer would also obviate stuff like `MAX2(1.0, now - time)`.

src/hotspot/share/runtime/trimNativeHeap.cpp line 88:

> 86: 
> 87:   void run() override {
> 88:     LogStartStop logStartStop;

Suggestion:

    LogStartStopMark lssm;

src/hotspot/share/runtime/trimNativeHeap.cpp line 92:

> 90:     for (;;) {
> 91:       double tnow = now();
> 92:       const double interval_secs = (double)TrimNativeHeapInterval / 1000;

This division can be outside the loop.

src/hotspot/share/runtime/trimNativeHeap.cpp line 106:

> 104:             ml.wait(wait_ms);
> 105:           } else if (at_or_nearing_safepoint()) {
> 106:             ml.wait(safepoint_poll_ms);

OK, so here is a little problem. Suppose I want to run trims very often, like every 10ms. This loop would stall for 250ms when safepoint is detected, which throws off this guarantee. Can we instead go and sleep for `TrimNativeHeapInterval`? AFAICs, this plays nicely with heuristic guidance (short intervals -> more interference), and it would best-effort stall for twice the interval when safepoint interjects.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14781#pullrequestreview-1521753659
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258038321
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258056866
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258061662
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258074902
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258079941
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258063727
PR Review Comment: https://git.openjdk.org/jdk/pull/14781#discussion_r1258086409