<div dir="ltr"><div dir="ltr">Hi, I noticed a source of overhead when calling foreign functions with small aggregate return values.<br><br><div>For example, a function returning a <font face="monospace">struct Vector2D { double x; double y }</font> will cause a malloc/free inside the downcall handle on every call. On my machine, this accounts for about 80% of the call overhead.</div><div><br></div><div>Choice stack:</div><div><pre style="box-sizing:border-box;font-size:11.9px;margin-top:0px;overflow:auto;line-height:1.45;color:rgb(31,35,40);border-radius:6px"><code style="box-sizing:border-box;padding:0px;margin:0px;background:transparent;border-radius:6px;word-break:normal;border:0px;display:inline;overflow:visible;line-height:inherit">   java.lang.Thread.State: RUNNABLE

<b>       at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native Method)

</b>...

<b>       at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)

</b>      at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown Source)

        at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@25-ea/DirectMethodHandle$Holder)

</code></pre></div><div>While it might be difficult to eliminate these intermediate buffers, I would propose to try reusing them.</div><div><br></div><div><div><div>What's happening here:</div></div></div><div>* the ARM64 ABI returns such a struct in two 128 bit registers v0/v1 [0]</div><div>* the VM stub calling convention around this expects an output buffer to copy v0/v1 into:  [1]</div><div><font face="monospace">stub(out) { ... out[0..16) = v0; out[16..32) = v1; }</font></div><div>* the FFM downcall calling convention OTOH expects a user-provided SegmentAllocator to allocate a 16 byte StructLayout(JAVA_DOUBLE, JAVA_DOUBLE). The generated method handle to adapt to the stub looks roughly like this [2]:</div><div> ffm(allocator) {</div><div><b>  tmp = malloc(32)</b></div><div>  stub(tmp)</div><div>  result = allocator.allocate(16)</div><div>  result[0..8) = tmp[0..8)</div><div>  result[8..16) = tmp[16..24)</div><b>  free(tmp)</b></div><div dir="ltr">  return result<br class="gmail-Apple-interchange-newline"><div>}</div><div><br></div><div>Now there's an easy way around this for the user by using a different native signature:</div><div><div><font face="monospace">void g(Vector2D *out) { *out = f(); }</font></div><div>This eliminates the intermediate buffer altogether.</div><div><br></div><div><div>However, if we wanted to optimize the return-by-value path, I can think of three options:</div><div>* enhance the stub calling conventions to directly copy only the narrowed output registers into the result buffer.  This looks rather involved.</div><div>* allocate the tmp buffer using the user's allocator as well (e.g. in conjunction with the result + slicing). The Linker api is somewhat lenient about how `allocator` will be exactly invoked: "used by the linker runtime to allocate the memory region associated with the struct returned by the downcall method handle".  However, this may be surprising to the caller.</div><div>* keep the tmp buffer allocation internal, but optimize it. This is what I'm proposing here.</div><div><br></div></div><div>A possible counter-argument could be "this is just one allocation out of two". However, the user has control over `allocator`, and may re-use the same segment across calls, but they have no control over the tmp allocation.</div><div><br></div></div><div><div>I've worked on a patch that takes this last route, using a one-element thread-local cache: <a href="https://github.com/openjdk/jdk/pull/23142">https://github.com/openjdk/jdk/pull/23142</a>, it reduces call time from 36->8ns / op on my machine and I observe no more GC's.</div></div><div><br></div><div>Would there be interest in pursuing this?</div><div><br></div><div>Thx</div><div>Matthias</div><div><br></div><div><br></div><div>[0] <a href="https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values" target="_blank">https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values</a></div><div>[1] <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97" target="_blank">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97</a></div><div>[2] "binding context": <a href="https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296">https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296</a></div></div>

</div>