RFR: JDK-8293114: GC should trim the native heap [v10]

Wed Jul 5 17:28:22 UTC 2023

On Mon, 3 Jul 2023 10:25:49 GMT, Volker Simonis <simonis at openjdk.org> wrote:

>> Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 34 commits:
>> 
>>  - wip
>>  - Merge branch 'master' into JDK-8293114-GC-trim-native
>>  - wip
>>  - merge master
>>  - wip
>>  - wip
>>  - rename GCTrimNative TrimNative
>>  - rename NativeTrimmer
>>  - rename
>>  - src/hotspot/share/gc/shared/gcTrimNativeHeap.cpp
>>  - ... and 24 more: https://git.openjdk.org/jdk/compare/99f5687e...5d41312e
>
> My main concern with this change is increased latency. You wrote "*..concurrent malloc/frees are usually not blocked while trimming if they are satisfied from the local arena..*". Not sure what "*usually*" means here and how many mallocs are satisfied from a local arena. But introducing pauses up to a second seems significant for some applications. 
> 
> The other question is that I still don't understand if glibc-malloc will ever call `malloc_trim()` automatically (and in that case introduce the latency anyway). The manpage says that `malloc_trim()` "*..is automatically called by  free(3)  in  certain  circumstances;  see  the  discussion  of  `M_TOP_PAD`  and `M_TRIM_THRESHOLD` in `mallopt(3)`..*" but you reported that you couldn't observe any cleanup effect when playing around with `M_TRIM_THRESHOLD`. In the end, calling `malloc_trim()` periodically might even help to decrease latency if this prevents seldom, but longer automatic invocations of `malloc_trim()` by glibc itself.

@simonis @robehn Thanks for thinking this through.

>My main concern with this change is increased latency. You wrote "..concurrent malloc/frees are usually not blocked while trimming if they are satisfied from the local arena..". Not sure what "usually" means here and how many mallocs are satisfied from a local arena. But introducing pauses up to a second seems significant for some applications.

>From looking at the sources (glibc 2.31), I see `malloc_trim` iterates over all arenas and locks each arena while trimming it. I also see this lock getting locked when:
  - creating, re-assigning arenas
  - on statistics (`malloc_stats`, `mallinfo2`, `malloc_info`)
  - on `arena_get_retry()` which AFAICS seems to be "stealing" from a neighboring arena if my arena is full
  - on the free path, when adding chunks to the fast bin, but guarded by __builtin_expect(.., 0), so probably a very rare path
  - on realloc'ing non-mmapped chunks.

So, malloc_trim will incovenience concurrent reallocs, and rarely frees, or allocations that cause arena stealing or allocating new arenas. I may have missed some cases, but it makes sense that glibc attempts to avoid locking as much as possible.

About the "up to a second" - this was measured on my machine with ~32GB of reclaimable memory. Having that much floating garbage in the C-heap would hopefully be rare.

>The other question is that I still don't understand if glibc-malloc will ever call malloc_trim() automatically (and in that case introduce the latency anyway). The manpage says that malloc_trim() "..is automatically called by free(3) in certain circumstances; see the discussion of M_TOP_PAD and M_TRIM_THRESHOLD in mallopt(3).." but you reported that you couldn't observe any cleanup effect when playing around with M_TRIM_THRESHOLD. In the end, calling malloc_trim() periodically might even help to decrease latency if this prevents seldom, but longer automatic invocations of malloc_trim() by glibc itself.

>From looking at the sources, the glibc trims on free:
- the returned chunk may go into the thread local cache (tcache) or into the fastbin. In both cases, the chunk still counts as used. Nothing happens.
- Otherwise, the returned chunk gets merged with its immediate neighbors (once, so not recursively). If the resulting size is larger than 64K, glibc calls trim, but only for the arena the chunk is contained in.

As you can see, trim only happens sometimes. I did experiments with mallocing, then freeing, 64K 30000 times:

1) done from 10 threads will leave me with a remainder of 300-500MB unreclaimed RSS after all is freed.
2) done from 1 thread, or setting MALLOC_ARENA_MAX=1, ends up reclaiming most of the memory.

Unfortunately, most of C-Heap allocations are a lot finer grained than 64K.

*Update*

I see we also lock on the malloc path if we don't pull the chunk from the tcache... .

-------------

PR Comment: https://git.openjdk.org/jdk/pull/10085#issuecomment-1619638641