Proposal: On Linux, add option to periodically trim the glibc heap

Fri Aug 27 09:05:20 UTC 2021

Hi all,

I would like to gauge opinions on the following idea.

The glibc can be very reluctant to return released C-heap memory to the OS.
Memory returned to it via free(3) is typically cached within the glibc and
still counts toward the process RSS. It may be reused for
future allocations, but for now it is idling around. This manifests in the
process RSS not recovering from C-heap usage spikes, e.g. after intense
compilations. This effect is most pronounced with many small allocations
which have actually be touched - large allocations are mapped and unmapped
correctly on free(3).

I do not know of any other libc which has this problem to this extend. AIX
libc is somewhat bad too - in our proprietary VM we use manual mmap as a
backing for Arenas for that reason - but glibc is worse.

However, the glibc offers a very simple API to shake the memory loose and
return it to the OS: `malloc_trim(3)` [1].

With JDK-8269345 [2], SAP contributed a new jcmd command to *manually* trim
the C-heap on Linux: `jcmd System.trim_native_heap` will force the glibc to
release excess memory and print out the resulting RSS shrinkage. For
example, executing it on my running CDT reduces RSS by 400M:

```
thomas at starfish$ jcmd eclipse System.trim_native_heap
...
RSS before: 2229376k, after: 1784572k, (-444804k)
```

I see other memory reductions in different apps. E.g., with the springboot
petclinic, right after startup:

```
thomas at starfish$ jcmd petclinic System.trim_native_heap
...
RSS before: 649424k, after: 471052k, (-178372k)
```

JDK-8269345 explicitly avoided adding logic to any form of automatic
trimming. The intent of this command was to analyze high-memory situations
- especially to find out what part of RSS was glibc-cached memory.

-----

As a second step, I'd like to introduce some sort of automatic glibc
trimming. Implementing this - automatically calling `malloc_trim(3)`
periodically, or tied to some VM event - is technically easy. The big
problem is to decide if that makes sense.

When we trim, we release cached memory to the OS. Trimming itself costs a
bit, but the highest cost is re-acquiring that memory should we need it
again. Which is expensive especially if we are under memory pressure. But
that is also the scenario where releasing excess memory makes the most
sense: keeping memory cached within the glibc restricts this memory to
being used only by this VM process, only for subsequent malloc() calls,
which also need to have the right size for the freed blocks being reused.
Releasing it means it can be repurposed for some other consumer, which may
include the VM itself.

As with every caching, we would have to compare the cost of re-acquiring
memory with the benefit of having this memory available to other parts of
the system which need it more urgently. This is difficult since we'd need
to know how many mallocs will happen in the immediate future of the VM
process.

There are some shaky heuristics. For example, the JIT is one of the
heaviest C-heap consumers. Most of the compilations happen at startup, so
for some applications, we have a C-heap usage spike at the start, and not
much after that. So after the bulk of compilations have been done, it would
be a good time to call malloc_trim.

But relying on VM-internal information is fragile considering that non-VM
code may also call malloc within our process. Non-VM code may include our
JDK native code, third-party JNI code as well as System libraries. All
these may cause rare spikes in C-heap usage, or spike frequently, or have a
constant churn of malloc/free calls. We have no idea.

Therefore, I propose a simple and straightforward solution: let's add an
optional parameter (by default off), `-XX:+TrimNativeHeap`, coupled with a
second parameter `-XX:TrimNativeHeapPeriod=<seconds>`. When enabled, we
would periodically trim the glibc heap. We would choose the default period
to be reasonably long, e.g. 10 seconds.

This approach hands the instrument to the user but leaves the
responsibility with him too. If the user observes RSS bounce back after
each trim, forming a steep sawtooth pattern, it means the process really
needs that memory and is re-acquiring it each time. In that case, releasing
it probably hurts more than it helps. OTOH, if RSS stays flat after the
spike, at least for a number of periods, releasing the memory did make
sense.

----

We could add to this logic VM-internal knowledge, but I am not sure if that
complexity is warranted since we never can reliably guess the future
allocation behavior of both VM and non-VM code:
- trimming when (big) arenas are released or when the arena pool is cleaned
out
- trimming periodically, but not if a certain number of os::malloc calls
happened in the recent past
- trimming if the delta between current and peak allocation is X
   (would require NMT to be always on)

What do you think?

Thanks, Thomas

-----------

[1] https://man7.org/linux/man-pages/man3/malloc_trim.3.html
[2] https://bugs.openjdk.java.net/browse/JDK-8269345