<div dir="ltr"><div>> let us not derail this discussion. <br></div><div><br></div><div>Last comment from me on the topic. </div><div><br></div><div>I have seen this on workload from my previous employer using ~4Gig, I was able to reduce native memory from ~1200 MiB to ~400 MiB, likely due to the arena and fragmentation. And the worst is that native memory was increasing at a very slow pace but steadily ; I don't think it was a leak but I cannot guarantee that either. That said, changing the native allocator removed this bad behavior.</div><div><br></div><div>My current job is not anymore about production so I don't follow everything, but I've seen that colleagues have similar issues and when they tried another allocator their problem was gone.</div><div><br></div><div>I have not dived into what other language runtime experiences, but I regularly saw advice to change the default allocator.<br></div><div><br></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div>-- Brice</div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Dec 5, 2023 at 4:12 PM Thomas Stüfe <<a href="mailto:thomas.stuefe@gmail.com">thomas.stuefe@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Dec 5, 2023 at 3:36 PM Brice Dutheil <<a href="mailto:brice.dutheil@gmail.com" target="_blank">brice.dutheil@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>> If it is just about using a standard replacement like jemalloc.</div><div><br></div><div>From my experience, and what I believe Johan was asking as well, is indeed that.</div><div><br></div><div>Deployment of workloads that need that, usually rely on "installing" an allocator library that is configured via `LD_PRELOAD`. This usually gives the option to change the allocator depending on multiple criteria : the workload itself, the CPU architecture. Sometimes jemalloc is better, sometimes tcmalloc is better, (not tried minimalloc), so the flexibility to tweak that is important. </div><div>_All are better than glibc's malloc (arena "recycling" is quite bad in containerized envs and with multiple threads, leading to many dirty pages and higher RSS)._<br></div><div><br></div></div></blockquote><div><br></div><div>I always wondered how much of that is urban legend. I measured myself a while ago (maybe I can dig up the results somewhere), and IIRC, I could produce artificial scenarios with way more overhead for the glibc case, but in the practical cases, it seemed not to matter. I even saw cases where glibc was better. <br></div><div><br></div><div>In any case, let us not derail this discussion. If jemalloc compatibility is required, I don't think it would be a show-stopper.<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div><div>So that's why I was envisioning a "standard" use of the preload ability of the linker, e.g. `LD_PROLOAD=path/to/jdk/lib/libjnmt.so /path/to/tcmalloc.so`.</div><div>...assuming it can work.<br></div><div><br></div><div><br></div><div><br></div><div><div><div dir="ltr" class="gmail_signature"><div>-- Brice</div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Dec 5, 2023 at 1:50 PM Thomas Stuefe <<a href="mailto:tstuefe@redhat.com" target="_blank">tstuefe@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Hi Brice,<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Dec 5, 2023 at 12:49 AM Brice Dutheil <<a href="mailto:brice.dutheil@gmail.com" target="_blank">brice.dutheil@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div dir="auto">Hi Joha,</div><div dir="auto"><br></div><div dir="auto">Thomas will correct me as he is proposed the idea and much more experienced, also I'm a mere reader of this ML. </div><div dir="auto"><br></div><div dir="auto">So, I have not toyed with the code, but I believe this should work, at least on linux if linker has no restrictions.</div><div dir="auto"><br></div><div dir="auto">Typically interception happens because there is a function with the right signature preloaded (via <span style="float:none;display:inline;background-color:rgba(0,0,0,0);border-color:rgb(0,0,0);color:rgb(0,0,0)">`LD_PRELOAD`) t</span>hat linker will look up. The magic can work because in order to do real work and invoke the right methods down the line using `dlsym(RTLD_NEXT, name)`. And that should be the next library on the path or the system as the linker should process from left to right this `LD_PRELOAD`.</div><div dir="auto"><br></div><div dir="auto">```</div><div dir="auto"><div dir="auto">void *malloc(size_t size) {</div></div><div dir="auto"><div dir="auto"><div dir="auto"><div dir="auto"> void *(*p_malloc)(size_t) = dlsym(RTLD_NEXT, "malloc");</div></div></div></div></div><div></div></div><div><div><div dir="auto"><br></div><div dir="auto"> // report back mem operation </div><div dir="auto"><br></div><div dir="auto"> return p_malloc(size);</div><div dir="auto">}</div><div dir="auto">```</div><div dir="auto"><br></div><div dir="auto"><div><a href="https://man7.org/linux/man-pages/man8/ld.so.8.html" target="_blank">https://man7.org/linux/man-pages/man8/ld.so.8.html</a></div><div><a href="https://www.man7.org/linux/man-pages/man3/dlsym.3.html" target="_blank">https://www.man7.org/linux/man-pages/man3/dlsym.3.html</a></div><br></div><div dir="auto">That said this might be tricky to avoid loops, if one function calls `malloc`.</div></div></div></blockquote><div><br></div><div>I think a simpler way would be to just add a way for libjnmt.so to use custom allocators. If it is just about using a standard replacement like jemalloc, a custom-tailored solution for that would be a lot simpler. But, again, not sure about the use case.<br></div><div><br></div><div>Cheers, Thomas<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div dir="auto"><br></div><div dir="auto">Also I suppose this could work on macos via `DYLD_PRELOAD` but unsure since macos has some restrictions.</div><div dir="auto"><br clear="all">-- <br clear="all"><div dir="auto"><div dir="ltr" class="gmail_signature">Brice</div></div></div><div><br></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Dec 4, 2023 at 13:14 Johan Sjölén <<a href="mailto:johan.sjolen@oracle.com" target="_blank">johan.sjolen@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>
<div>
<p>Hi Thomas,<br>
</p>
<p>If a user would like to switch out the malloc which a JVM is
using, would they be able to do that while simultaneously using
your interception library?<br>
<br>
Thank you,<br>
Johan<br>
</p></div><div>
<blockquote type="cite">
<div dir="ltr">Hi, community,<br>
<br>
I experimented with extending Native Memory Tracking across the
whole process. I want to share my findings and propose a new JDK
feature to allow us to do that.<br>
<br>
TL;DR<br>
<br>
Proposed is a "native memory interposition library" shipped with
the JDK that would intercept all native memory calls from
everywhere and redirect them to NMT.<br>
<br>
Motivation:<br>
<br>
NMT is very useful but limited in its coverage. It only covers
Hotspot and a select few sites from the JDK. Most of the JDK,
third-party native code, and system libraries are not covered.
This is a large hole in our observability. I have seen people do
(and done myself! eg [1]) strange and weird things to hunt
memory leaks in native code. This is especially tricky in
locked-down customer scenarios.<br>
<br>
But NMT is a capable tracker. We could use it for much more than
just tracking Hotspot.<br>
<br>
In the past, developers have attempted to extend NMT
instrumentation over parts of the JDK (e.g. [2]), which met
resistance from Oracle. This is understandable: a naive
extension would require libraries to link against the libjvm and
instrument their coding. That introduces new dependencies nobody
wants.<br>
<br>
---<br>
<br>
I propose a different way that works without instrumenting any
caller code. I hope this proposal proves less controversial than
brute-force NMT instrumentation of the JDK. And it would allow
introspection of non-JDK parts too.<br>
<br>
<div>We could ship an interception library (a "libjnmt.so")
within the JDK. That library, if preloaded, would redirect
native memory requests to NMT. A customer who wants to analyze
the native memory footprint of its apps could start the JVM
with <span style="font-family:monospace">LD_PRELOAD=libjnmt</span>
and then use NMT for introspection.</div>
<div><br>
</div>
Oracle and we continuously improve NMT; extending its reach
across the whole process would leverage that investment nicely.<br>
<br>
<div>It also meshes well with other improvements. For example,
we report NMT numbers via JFR since [4] - with interposition,
we could now expose third-party native allocations via JFR.
The new jcmd "System.map" would automatically show memory
mappings from outside Hotspot. There is a precedent (libjsig),
so shipping interposition libraries is not that strange.<br>
</div>
<br>
---<br>
<br>
I have a Linux-based POC that works and looks promising [3].
With that prototype, I can see:<br>
<br>
- allocations from the JDK - e.g., now I finally see mapped byte
buffers.<br>
- allocations from third-party user code<br>
- most allocations from system libraries, e.g., from the system
zlib<br>
- allocations via the new FFI interface<br>
<br>
The prototype tracks both mmap and malloc. Technically, the
tricky part was to handle the initialization window: being able
to correctly handle allocations starting at the process C++
initialization while dynamically handing over allocations to the
libjvm once it is loaded and NMT is initialized. Another tricky
problem was to prevent circularities stemming from call
intercepting. The prototype solves these problems and is already
stable enough to be used.<br>
<br>
Note that the patch is not complex or large. Some small
interaction with the JVM is needed, though, so this cannot be
done just with an outside library.<br>
<br>
The prototype was developed and tested on Linux x64 and with
glibc 2.31. It seems stable so far, but of course, the work is
in an early stage, and bugs may exist. If you want to play with
the prototype, build it [3] and then call:<br>
<br>
<span style="font-family:monospace">LD_PRELOAD=${JDK_DIR}/lib/</span><span style="font-family:monospace">server/libjnmt.so
${JDK_DIR}/bin/java -XX:NativeMemoryTracking=</span><span style="font-family:monospace">detail <program>
<args></span><br>
<br>
Example: quarkus with "third-party code" injected that leaks
periodically [5]:<br>
<br>
<span style="font-family:monospace">LEAK_MALLOC=1 LEAK_MMAP=1
LD_PRELOAD=${JDK_DIR}/lib/</span><span style="font-family:monospace">server/libjnmt.so
${JDK_DIR}/bin/java -agentpath:/shared/projects/</span><span style="font-family:monospace">jvmti-leak/leaker.so
-XX:NativeMemoryTracking=</span><span style="font-family:monospace">detail -jar
./quarkus-profiling-workshop/</span><span style="font-family:monospace">target/quarkus-app/quarkus-</span><span style="font-family:monospace">run.jar</span><br>
<br>
In Summary mode, we see the slowly growing leaks:<br>
<br>
<span style="font-family:monospace">-External (via
interposition) (reserved=82216KB, committed=82216KB)<br>
(malloc=81588KB #585) (at peak)<br>
(mmap: reserved=628KB,
committed=628KB, at peak)</span><br>
<br>
<br>
and in Detail mode, their call stacks:<br>
<br>
<div><span style="font-family:monospace">[0x00007ff067ee7000 -
0x00007ff067ee8000] reserved and committed 4KB for External
(via interposition) from</span></div>
<div><span style="font-family:monospace">
[0x00007ff067ef5056]the_mmap(void*, unsigned long, int, int,
int, long)+0x66 in libjnmt.so</span></div>
<div><span style="font-family:monospace"></span></div>
<span style="font-family:monospace">
[0x00007ff067ef5781]mmap+0x71 in libjnmt.so<br>
[0x00007ff067ee955a]leak_mmap+0x3f in leaker.so<br>
[0x00007ff067ee95b1]leakleak+0x1c in leaker.so<br>
[0x00007ff067ee95c6]leakleakleak+0x12 in leaker.so<br>
[0x00007ff067ee95db]leakabit+0x12 in leaker.so<br>
[0x00007ff067ee95f8]leaky_thread+0x1a in leaker.so</span><br>
<br>
<span style="font-family:monospace"><br>
[0x00007ff067ef5166]the_malloc(unsigned long)+0x106 in
libjnmt.so<br>
[0x00007ff067ee94ae]do_malloc+0xb8 in leaker.so<br>
[0x00007ff067ee9518]leak_malloc+0x20 in leaker.so<br>
[0x00007ff067ee95a7]leakleak+0x12 in leaker.so<br>
[0x00007ff067ee95c6]leakleakleak+0x12 in leaker.so<br>
[0x00007ff067ee95db]leakabit+0x12 in leaker.so<br>
[0x00007ff067ee95f8]leaky_thread+0x1a in leaker.so<br>
(malloc=17679KB type=External
(via interposition) #34) (at peak)</span><br>
<br>
---<br>
<br>
What about MEMFLAGS?<br>
<br>
The prototype does not extend MEMFLAGS apart from introducing a
new "External" category that tracks allocations done via
interposition. The question of MEMFLAGS - in particular, opening
it up to outside extension - has been contentious. It is
orthogonal to this proposal - nice but not required.<br>
<br>
This proposal makes external allocations visible under the new
"External" tag:<br>
- in NMT summary mode, we only have the "External" total, which
is already useful even as a lump sum: it shows the footprint
non-hotspot libraries contribute to RSS. An RSS increase that is
reflected neither by hotspot allocations nor by "External" can
only stem from a select few places, e.g. from libc malloc
retention.<br>
- In NMT detail mode, this proposal shows us the call stacks to
foreign call sites, pinpointing at least the libraries involved.<br>
<br>
--<br>
<br>
What do you think, does this make sense?<br>
<br>
Thanks, Thomas<br>
<br>
<br>
[1] <a href="https://github.com/SAP/SapMachine/wiki/SapMachine-MallocTracer" target="_blank">https://github.com/SAP/SapMachine/wiki/SapMachine-MallocTracer</a><br>
[2] <a href="https://mail.openjdk.org/pipermail/core-libs-dev/2022-November/096197.html" target="_blank">https://mail.openjdk.org/pipermail/core-libs-dev/2022-November/096197.html</a><br>
[3] <a href="https://github.com/tstuefe/jdk/tree/libjnmt" target="_blank">https://github.com/tstuefe/jdk/tree/libjnmt</a><br>
[4] <a href="https://bugs.openjdk.org/browse/JDK-8157023" target="_blank">https://bugs.openjdk.org/browse/JDK-8157023</a><br>
[5] <a href="https://github.com/tstuefe/jvmti_leak" target="_blank">https://github.com/tstuefe/jvmti_leak</a></div>
</blockquote>
</div>
</blockquote></div></div>
</div>
</div>
</blockquote></div></div>
</blockquote></div>
</blockquote></div></div>
</blockquote></div>