Proposal: Extend Native Memory Tracking across the whole process via interposition

Nathan Reynolds numeralnathan at gmail.com
Tue Dec 5 17:35:45 UTC 2023


Fragmentation is a problem that all memory allocators have to deal with.
Each allocator deals with fragmentation in a different way.   Java's heap
(not to be confused with the native memory allocator) relocates objects to
deal with fragmentation.

I am not surprised that changing allocators will fix memory growth
problems.  A different allocator will be able to handle the allocation
pattern differently and prevent fragmentation.  I've had to replace the
default C allocator many times due to fragmentation and lock contention.

On Tue, Dec 5, 2023 at 9:23 AM Brice Dutheil <brice.dutheil at gmail.com>
wrote:

> > let us not derail this discussion.
>
> Last comment from me on the topic.
>
> I have seen this on workload from my previous employer using ~4Gig, I was
> able to reduce native memory from ~1200 MiB to ~400 MiB, likely due to the
> arena and fragmentation. And the worst is that native memory was increasing
> at a very slow pace but steadily ; I don't think it was a leak but I cannot
> guarantee that either. That said, changing the native allocator removed
> this bad behavior.
>
> My current job is not anymore about production so I don't follow
> everything, but I've seen that colleagues have similar issues and when they
> tried another allocator their problem was gone.
>
> I have not dived into what other language runtime experiences, but I
> regularly saw advice to change the default allocator.
>
> -- Brice
>
>
> On Tue, Dec 5, 2023 at 4:12 PM Thomas Stüfe <thomas.stuefe at gmail.com>
> wrote:
>
>>
>>
>> On Tue, Dec 5, 2023 at 3:36 PM Brice Dutheil <brice.dutheil at gmail.com>
>> wrote:
>>
>>> > If it is just about using a standard replacement like jemalloc.
>>>
>>> From my experience, and what I believe Johan was asking as well, is
>>> indeed that.
>>>
>>> Deployment of workloads that need that, usually rely on "installing" an
>>> allocator library that is configured via `LD_PRELOAD`. This usually gives
>>> the option to change the allocator depending on multiple criteria : the
>>> workload itself, the CPU architecture. Sometimes jemalloc is better,
>>> sometimes tcmalloc is better, (not tried minimalloc), so the flexibility to
>>> tweak that is important.
>>> _All are better than glibc's malloc (arena "recycling" is quite bad in
>>> containerized envs and with multiple threads, leading to many dirty pages
>>> and higher RSS)._
>>>
>>>
>> I always wondered how much of that is urban legend. I measured myself a
>> while ago (maybe I can dig up the results somewhere), and IIRC, I could
>> produce artificial scenarios with way more overhead for the glibc case, but
>> in the practical cases, it seemed not to matter. I even saw cases where
>> glibc was better.
>>
>> In any case, let us not derail this discussion. If jemalloc compatibility
>> is required, I don't think it would be a show-stopper.
>>
>>
>>
>>> So that's why I was envisioning a "standard" use of the preload ability
>>> of the linker, e.g. `LD_PROLOAD=path/to/jdk/lib/libjnmt.so
>>> /path/to/tcmalloc.so`.
>>> ...assuming it can work.
>>>
>>>
>>>
>>> -- Brice
>>>
>>>
>>> On Tue, Dec 5, 2023 at 1:50 PM Thomas Stuefe <tstuefe at redhat.com> wrote:
>>>
>>>> Hi Brice,
>>>>
>>>> On Tue, Dec 5, 2023 at 12:49 AM Brice Dutheil <brice.dutheil at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Joha,
>>>>>
>>>>> Thomas will correct me as he is proposed the idea and much more
>>>>> experienced, also I'm a mere reader of this ML.
>>>>>
>>>>> So, I have not toyed with the code, but I believe this should work, at
>>>>> least on linux if linker has no restrictions.
>>>>>
>>>>> Typically interception happens because there is a function with the
>>>>> right signature preloaded (via `LD_PRELOAD`) that linker will look
>>>>> up. The magic can work because in order to do real work and invoke the
>>>>> right methods down the line using `dlsym(RTLD_NEXT, name)`. And that should
>>>>> be the next library on the path or the system as the linker should process
>>>>> from left to right this `LD_PRELOAD`.
>>>>>
>>>>> ```
>>>>> void *malloc(size_t size) {
>>>>>   void *(*p_malloc)(size_t) = dlsym(RTLD_NEXT, "malloc");
>>>>>
>>>>>   // report back mem operation
>>>>>
>>>>>   return p_malloc(size);
>>>>> }
>>>>> ```
>>>>>
>>>>> https://man7.org/linux/man-pages/man8/ld.so.8.html
>>>>> https://www.man7.org/linux/man-pages/man3/dlsym.3.html
>>>>>
>>>>> That said this might be tricky to avoid loops, if one function calls
>>>>> `malloc`.
>>>>>
>>>>
>>>> I think a simpler way would be to just add a way for libjnmt.so to use
>>>> custom allocators. If it is just about using a standard replacement like
>>>> jemalloc, a custom-tailored solution for that would be a lot simpler. But,
>>>> again, not sure about the use case.
>>>>
>>>> Cheers, Thomas
>>>>
>>>>
>>>>>
>>>>> Also I suppose this could work on macos via `DYLD_PRELOAD` but unsure
>>>>> since macos has some restrictions.
>>>>>
>>>>> --
>>>>> Brice
>>>>>
>>>>>
>>>>> On Mon, Dec 4, 2023 at 13:14 Johan Sjölén <johan.sjolen at oracle.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Thomas,
>>>>>>
>>>>>> If a user would like to switch out the malloc which a JVM is using,
>>>>>> would they be able to do that while simultaneously using your interception
>>>>>> library?
>>>>>>
>>>>>> Thank you,
>>>>>> Johan
>>>>>>
>>>>>> Hi, community,
>>>>>>
>>>>>> I experimented with extending Native Memory Tracking across the whole
>>>>>> process. I want to share my findings and propose a new JDK feature to allow
>>>>>> us to do that.
>>>>>>
>>>>>> TL;DR
>>>>>>
>>>>>> Proposed is a "native memory interposition library" shipped with the
>>>>>> JDK that would intercept all native memory calls from everywhere and
>>>>>> redirect them to NMT.
>>>>>>
>>>>>> Motivation:
>>>>>>
>>>>>> NMT is very useful but limited in its coverage. It only covers
>>>>>> Hotspot and a select few sites from the JDK. Most of the JDK, third-party
>>>>>> native code, and system libraries are not covered. This is a large hole in
>>>>>> our observability. I have seen people do (and done myself! eg [1]) strange
>>>>>> and weird things to hunt memory leaks in native code. This is especially
>>>>>> tricky in locked-down customer scenarios.
>>>>>>
>>>>>> But NMT is a capable tracker. We could use it for much more than just
>>>>>> tracking Hotspot.
>>>>>>
>>>>>> In the past, developers have attempted to extend NMT instrumentation
>>>>>> over parts of the JDK (e.g. [2]), which met resistance from Oracle. This is
>>>>>> understandable: a naive extension would require libraries to link against
>>>>>> the libjvm and instrument their coding. That introduces new dependencies
>>>>>> nobody wants.
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> I propose a different way that works without instrumenting any caller
>>>>>> code. I hope this proposal proves less controversial than brute-force NMT
>>>>>> instrumentation of the JDK. And it would allow introspection of non-JDK
>>>>>> parts too.
>>>>>>
>>>>>> We could ship an interception library (a "libjnmt.so") within the
>>>>>> JDK. That library, if preloaded, would redirect native memory requests to
>>>>>> NMT. A customer who wants to analyze the native memory footprint of its
>>>>>> apps could start the JVM with LD_PRELOAD=libjnmt and then use NMT
>>>>>> for introspection.
>>>>>>
>>>>>> Oracle and we continuously improve NMT; extending its reach across
>>>>>> the whole process would leverage that investment nicely.
>>>>>>
>>>>>> It also meshes well with other improvements. For example, we report
>>>>>> NMT numbers via JFR since [4] - with interposition, we could now expose
>>>>>> third-party native allocations via JFR. The new jcmd "System.map" would
>>>>>> automatically show memory mappings from outside Hotspot. There is a
>>>>>> precedent (libjsig), so shipping interposition libraries is not that
>>>>>> strange.
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> I have a Linux-based POC that works and looks promising [3]. With
>>>>>> that prototype, I can see:
>>>>>>
>>>>>> - allocations from the JDK - e.g., now I finally see mapped byte
>>>>>> buffers.
>>>>>> - allocations from third-party user code
>>>>>> - most allocations from system libraries, e.g., from the system zlib
>>>>>> - allocations via the new FFI interface
>>>>>>
>>>>>> The prototype tracks both mmap and malloc. Technically, the tricky
>>>>>> part was to handle the initialization window: being able to correctly
>>>>>> handle allocations starting at the process C++ initialization while
>>>>>> dynamically handing over allocations to the libjvm once it is loaded and
>>>>>> NMT is initialized. Another tricky problem was to prevent circularities
>>>>>> stemming from call intercepting. The prototype solves these problems and is
>>>>>> already stable enough to be used.
>>>>>>
>>>>>> Note that the patch is not complex or large. Some small interaction
>>>>>> with the JVM is needed, though, so this cannot be done just with an outside
>>>>>> library.
>>>>>>
>>>>>> The prototype was developed and tested on Linux x64 and with glibc
>>>>>> 2.31. It seems stable so far, but of course, the work is in an early stage,
>>>>>> and bugs may exist. If you want to play with the prototype, build it [3]
>>>>>> and then call:
>>>>>>
>>>>>> LD_PRELOAD=${JDK_DIR}/lib/server/libjnmt.so ${JDK_DIR}/bin/java
>>>>>> -XX:NativeMemoryTracking=detail <program> <args>
>>>>>>
>>>>>> Example: quarkus with "third-party code" injected that leaks
>>>>>> periodically [5]:
>>>>>>
>>>>>> LEAK_MALLOC=1 LEAK_MMAP=1 LD_PRELOAD=${JDK_DIR}/lib/server/libjnmt.so
>>>>>> ${JDK_DIR}/bin/java -agentpath:/shared/projects/jvmti-leak/leaker.so
>>>>>> -XX:NativeMemoryTracking=detail -jar ./quarkus-profiling-workshop/
>>>>>> target/quarkus-app/quarkus-run.jar
>>>>>>
>>>>>> In Summary mode, we see the slowly growing leaks:
>>>>>>
>>>>>> -External (via interposition) (reserved=82216KB, committed=82216KB)
>>>>>>                             (malloc=81588KB #585) (at peak)
>>>>>>                             (mmap: reserved=628KB, committed=628KB,
>>>>>> at peak)
>>>>>>
>>>>>>
>>>>>> and in Detail mode, their call stacks:
>>>>>>
>>>>>> [0x00007ff067ee7000 - 0x00007ff067ee8000] reserved and committed 4KB
>>>>>> for External (via interposition) from
>>>>>>     [0x00007ff067ef5056]the_mmap(void*, unsigned long, int, int, int,
>>>>>> long)+0x66 in libjnmt.so
>>>>>>     [0x00007ff067ef5781]mmap+0x71 in libjnmt.so
>>>>>>     [0x00007ff067ee955a]leak_mmap+0x3f in leaker.so
>>>>>>     [0x00007ff067ee95b1]leakleak+0x1c in leaker.so
>>>>>>     [0x00007ff067ee95c6]leakleakleak+0x12 in leaker.so
>>>>>>     [0x00007ff067ee95db]leakabit+0x12 in leaker.so
>>>>>>     [0x00007ff067ee95f8]leaky_thread+0x1a in leaker.so
>>>>>>
>>>>>>
>>>>>> [0x00007ff067ef5166]the_malloc(unsigned long)+0x106 in libjnmt.so
>>>>>> [0x00007ff067ee94ae]do_malloc+0xb8 in leaker.so
>>>>>> [0x00007ff067ee9518]leak_malloc+0x20 in leaker.so
>>>>>> [0x00007ff067ee95a7]leakleak+0x12 in leaker.so
>>>>>> [0x00007ff067ee95c6]leakleakleak+0x12 in leaker.so
>>>>>> [0x00007ff067ee95db]leakabit+0x12 in leaker.so
>>>>>> [0x00007ff067ee95f8]leaky_thread+0x1a in leaker.so
>>>>>>                              (malloc=17679KB type=External (via
>>>>>> interposition) #34) (at peak)
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> What about MEMFLAGS?
>>>>>>
>>>>>> The prototype does not extend MEMFLAGS apart from introducing a new
>>>>>> "External" category that tracks allocations done via interposition. The
>>>>>> question of MEMFLAGS - in particular, opening it up to outside extension -
>>>>>> has been contentious. It is orthogonal to this proposal - nice but not
>>>>>> required.
>>>>>>
>>>>>> This proposal makes external allocations visible under the new
>>>>>> "External" tag:
>>>>>> - in NMT summary mode, we only have the "External" total, which is
>>>>>> already useful even as a lump sum: it shows the footprint non-hotspot
>>>>>> libraries contribute to RSS. An RSS increase that is reflected neither by
>>>>>> hotspot allocations nor by "External" can only stem from a select few
>>>>>> places, e.g. from libc malloc retention.
>>>>>> - In NMT detail mode, this proposal shows us the call stacks to
>>>>>> foreign call sites, pinpointing at least the libraries involved.
>>>>>>
>>>>>> --
>>>>>>
>>>>>> What do you think, does this make sense?
>>>>>>
>>>>>> Thanks, Thomas
>>>>>>
>>>>>>
>>>>>> [1] https://github.com/SAP/SapMachine/wiki/SapMachine-MallocTracer
>>>>>> [2]
>>>>>> https://mail.openjdk.org/pipermail/core-libs-dev/2022-November/096197.html
>>>>>> [3] https://github.com/tstuefe/jdk/tree/libjnmt
>>>>>> [4] https://bugs.openjdk.org/browse/JDK-8157023
>>>>>> [5] https://github.com/tstuefe/jvmti_leak
>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/jdk-dev/attachments/20231205/b6e2d2a5/attachment-0001.htm>


More information about the jdk-dev mailing list