<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Thanks Matthias, adding loom-dev</p>
<p>Maurizio<br>
</p>
<div class="moz-cite-prefix">On 17/01/2025 13:37, Matthias Ernst
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CAKJ3wwEh-XH7+Azsz4YacOReMw8YkDxx6EmD2+WmFVUG7+1Adw@mail.gmail.com">
<div dir="ltr">
<div dir="ltr"><br>
</div>
<br>
<div class="gmail_quote gmail_quote_container">
<div dir="ltr" class="gmail_attr">On Fri, Jan 17, 2025 at
1:21 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">On Fri, Jan 17, 2025 at 1:09 AM Matthias
Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
wrote:</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Thanks very much for the feedback,
Jorn!
<div>I've incorporated the two-element cache, and
avoid using a shared session now (Unsafe instead).</div>
<div><br>
</div>
<div>I'm not sure about using non-carrier
ThreadLocals, I think a defining quality is that
you can only have as many (root) foreign function
invocations as you have carrier threads, so it is
fitting. With virtual threads you might allocate
xxxx such buffers for nought.</div>
<div><br>
</div>
<div>As to the confined session: it fits nicely into
the implementation, but I observe that it destroys
one very nice property of the patch: without it,
at least my test downcall seems to become
allocation-free (I see zero GC activity in the
benchmark), i.e. the "BoundedArea" and the buffer
slices seem to get completely scalar-replaced. As
soon as I add a per-call Arena.ofConfined() into
the picture, I see plenty of GC activity and the
call-overhead goes up (but still way less than
with malloc involved). I haven't looked in detail
into why that might be (I'm not very good with the
EA logs). I could argue this either way, but an
allocation free foreign call seems like a nice
property, whereas I'm reasonably sure these tmp
buffers cannot escape the call? Is that maybe
something that could be enabled only with a debug
flag?</div>
</div>
</blockquote>
<div><br>
</div>
<div>I looked at this in more detail. The great news: I
got the confined session on top of the carrier-local
cache to be properly scalar-replaced. This now does
everything we want: lock-free buffer acquisition, two
cache entries, confinement while borrowed from the
cache, and everything allocation free in 8ns
roundtrip. I've updated the branch accordingly.</div>
<div><br>
</div>
<div>The less great news: I seem to be running into a
bug in escape analysis. </div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>That one was easy to repro: calling
Continuation.pin/unpin in a constructor seems to confuse
escape analysis. Please see:</div>
<div><a href="https://github.com/mernst-github/repro/tree/main/escape-analysis" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/mernst-github/repro/tree/main/escape-analysis</a></div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_quote">
<div>Depending on where I place the ".ofConfined()" call
in the BoundedArea constructor I get either:</div>
<div>proper scalar replacement, but a crash in
fastdebug:</div>
<div> # Internal Error
(/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
pid=85070, tid=26115</div>
# assert(false) failed: EA: missing memory path<br>
<div>OR</div>
<div> fastdebug works, but fails to scalar replace the
confined session.</div>
<div><br>
</div>
<div>See comments in <a href="https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409</a>:<br>
<br>
</div>
<div> public BoundedArena(long size) {<br>
// When here, works in fastdebug, but not
scalar-replaced:<br>
// scope = Arena.ofConfined();
<======================<br>
<br>
MemorySegment cached = size <=
BufferCache.CACHED_BUFFER_SIZE ? BufferCache.acquire()
: null;<br>
<br>
// When here, works in release build, but
fastdebug crashes:<br>
// # Internal Error
(/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
pid=85070, tid=26115<br>
// # assert(false) failed: EA: missing
memory path<br>
scope = Arena.ofConfined();
<======================<br>
</div>
<div><br>
</div>
<div>Crash logs are attached.</div>
<div><br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div><br>
</div>
<div>Matthias</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jan 16,
2025 at 6:26 PM Jorn Vernee <<a href="mailto:jorn.vernee@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">jorn.vernee@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hello Matthias,</p>
<p>We've been exploring this direction
internally as well. As you've found, downcall
handles/upcall stubs sometimes need to
allocate memory. The return buffer case that
you've run into is one such case, others are:
when a struct that does not fit into a single
register is passed by value on Windows, we
need to create a copy. When a struct is passed
by value to an upcall stub, we need to
allocate memory to hold the value.</p>
<p>I took a look at your patch. One of the
problems I see with a one-element cache is
that some upcall stubs might never benefit
from it, since a preceding downcall already
claimed the cache. Though, I believe a chain
of downcalls and upcalls is comparatively
rare. A two element cache might be better.
That way a sequence of downcall -> upcall,
that both use by-value structs, will be able
to benefit from the cache.</p>
<p>Having a cache per carrier thread is probably
a good idea. A cache per thread is also
possibly an option, if the overhead seems
acceptable (the cache is only initialized for
threads that actually call native code after
all). This would also be a little faster, I
think.<br>
</p>
<p>One thing that's unfortunate is the use of a
shared arena, even in the fallback case, since
closing that is very slow. Another problem <span><span>is
that with your current implementation, we
are no longer tracking the lifetime of the
memory correctly, and it is possible to
access memory that was already returned to
the cache. Using a proper lifetime (i.e.
creating/closing a new arena per call) has
helped to catch bugs in the past. If we
want to keep doing that, we'd have to
re-wrap the memory of the cache with a new
arena (using MemorySegment::reinterpret),
which we then close after a downcall, to
return elements to the cache. I suggest
restructuring the code so that it always
creates a new confined arena, as today,
but then either: 1) grabs a memory segment
from the cache, and attaches that to the
new confined arena (using
MS::reintrepret), or 2) in the case of a
cache miss, just allocates a new segment
from the confined arena we created.</span></span></p>
<p><span><span>WDYT?<br>
</span></span></p>
<p><span><span>Jorn<br>
</span></span></p>
<div>On 16-1-2025 11:00, Matthias Ernst wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">Hi, I noticed a source of
overhead when calling foreign functions
with small aggregate return values.<br>
<br>
<div>For example, a function returning a <font face="monospace">struct Vector2D {
double x; double y }</font> will cause
a malloc/free inside the downcall handle
on every call. On my machine, this
accounts for about 80% of the call
overhead.</div>
<div><br>
</div>
<div>Choice stack:</div>
<div>
<pre style="box-sizing:border-box;font-size:11.9px;margin-top:0px;overflow:auto;line-height:1.45;color:rgb(31,35,40);border-radius:6px"><code style="box-sizing:border-box;padding:0px;margin:0px;background:transparent;border-radius:6px;word-break:normal;border:0px;display:inline;overflow:visible;line-height:inherit"> java.lang.Thread.State: RUNNABLE
<b> at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native Method)
</b>...
<b> at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(<a href="mailto:java.base@25-ea/SharedUtils.java:386" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">java.base@25-ea/SharedUtils.java:386</a>)
</b> at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown Source)
at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@25-ea/DirectMethodHandle$Holder)
</code></pre>
</div>
<div>While it might be difficult to
eliminate these intermediate buffers, I
would propose to try reusing them.</div>
<div><br>
</div>
<div>
<div>
<div>What's happening here:</div>
</div>
</div>
<div>* the ARM64 ABI returns such a struct
in two 128 bit registers v0/v1 [0]</div>
<div>* the VM stub calling convention
around this expects an output buffer to
copy v0/v1 into: [1]</div>
<div><font face="monospace">stub(out) {
... out[0..16) = v0; out[16..32) = v1;
}</font></div>
<div>* the FFM downcall calling convention
OTOH expects a user-provided
SegmentAllocator to allocate a 16 byte
StructLayout(JAVA_DOUBLE, JAVA_DOUBLE).
The generated method handle to adapt to
the stub looks roughly like this [2]:</div>
<div> ffm(allocator) {</div>
<div><b> tmp = malloc(32)</b></div>
<div> stub(tmp)</div>
<div> result = allocator.allocate(16)</div>
<div> result[0..8) = tmp[0..8)</div>
<div> result[8..16) = tmp[16..24)</div>
<b> free(tmp)</b></div>
<div dir="ltr"> return result<br>
<div>}</div>
<div><br>
</div>
<div>Now there's an easy way around this
for the user by using a different native
signature:</div>
<div>
<div><font face="monospace">void
g(Vector2D *out) { *out = f(); }</font></div>
<div>This eliminates the intermediate
buffer altogether.</div>
<div><br>
</div>
<div>
<div>However, if we wanted to optimize
the return-by-value path, I can
think of three options:</div>
<div>* enhance the stub calling
conventions to directly copy only
the narrowed output registers into
the result buffer. This looks
rather involved.</div>
<div>* allocate the tmp buffer using
the user's allocator as well (e.g.
in conjunction with the result +
slicing). The Linker api is somewhat
lenient about how `allocator` will
be exactly invoked: "used by the
linker runtime to allocate the
memory region associated with the
struct returned by the downcall
method handle". However, this may
be surprising to the caller.</div>
<div>* keep the tmp buffer allocation
internal, but optimize it. This is
what I'm proposing here.</div>
<div><br>
</div>
</div>
<div>A possible counter-argument could
be "this is just one allocation out of
two". However, the user has control
over `allocator`, and may re-use the
same segment across calls, but they
have no control over the tmp
allocation.</div>
<div><br>
</div>
</div>
<div>
<div>I've worked on a patch that takes
this last route, using a one-element
thread-local cache: <a href="https://github.com/openjdk/jdk/pull/23142" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/23142</a>,
it reduces call time from 36->8ns /
op on my machine and I observe no more
GC's.</div>
</div>
<div><br>
</div>
<div>Would there be interest in pursuing
this?</div>
<div><br>
</div>
<div>Thx</div>
<div>Matthias</div>
<div><br>
</div>
<div><br>
</div>
<div>[0] <a href="https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values</a></div>
<div>[1] <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97</a></div>
<div>[2] "binding context": <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296</a></div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</body>
</html>