[FFM performance] Intermediate buffer allocation when returning structs

Thu Jan 16 10:00:32 UTC 2025

Hi, I noticed a source of overhead when calling foreign functions with
small aggregate return values.

For example, a function returning a struct Vector2D { double x; double y }
will cause a malloc/free inside the downcall handle on every call. On my
machine, this accounts for about 80% of the call overhead.

Choice stack:

   java.lang.Thread.State: RUNNABLE*	at
jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native
Method)
*...*	at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
*	at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown
Source)
	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)

While it might be difficult to eliminate these intermediate buffers, I
would propose to try reusing them.

What's happening here:
* the ARM64 ABI returns such a struct in two 128 bit registers v0/v1 [0]
* the VM stub calling convention around this expects an output buffer to
copy v0/v1 into:  [1]
stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
* the FFM downcall calling convention OTOH expects a user-provided
SegmentAllocator to allocate a 16 byte StructLayout(JAVA_DOUBLE,
JAVA_DOUBLE). The generated method handle to adapt to the stub looks
roughly like this [2]:
 ffm(allocator) {
*  tmp = malloc(32)*
  stub(tmp)
  result = allocator.allocate(16)
  result[0..8) = tmp[0..8)
  result[8..16) = tmp[16..24)
*  free(tmp)*
  return result
}

Now there's an easy way around this for the user by using a different
native signature:
void g(Vector2D *out) { *out = f(); }
This eliminates the intermediate buffer altogether.

However, if we wanted to optimize the return-by-value path, I can think of
three options:
* enhance the stub calling conventions to directly copy only the narrowed
output registers into the result buffer.  This looks rather involved.
* allocate the tmp buffer using the user's allocator as well (e.g. in
conjunction with the result + slicing). The Linker api is somewhat lenient
about how `allocator` will be exactly invoked: "used by the linker runtime
to allocate the memory region associated with the struct returned by the
downcall method handle".  However, this may be surprising to the caller.
* keep the tmp buffer allocation internal, but optimize it. This is what
I'm proposing here.

A possible counter-argument could be "this is just one allocation out of
two". However, the user has control over `allocator`, and may re-use the
same segment across calls, but they have no control over the tmp allocation.

I've worked on a patch that takes this last route, using a one-element
thread-local cache: https://github.com/openjdk/jdk/pull/23142, it reduces
call time from 36->8ns / op on my machine and I observe no more GC's.

Would there be interest in pursuing this?

Thx
Matthias

[0]
https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
[1]
https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
[2] "binding context":
https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250116/63e157e2/attachment-0001.htm>