[FFM performance] Intermediate buffer allocation when returning structs
Matthias Ernst
matthias at mernst.org
Thu Jan 16 10:00:32 UTC 2025
Hi, I noticed a source of overhead when calling foreign functions with
small aggregate return values.
For example, a function returning a struct Vector2D { double x; double y }
will cause a malloc/free inside the downcall handle on every call. On my
machine, this accounts for about 80% of the call overhead.
Choice stack:
java.lang.Thread.State: RUNNABLE* at
jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native
Method)
*...* at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
* at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown
Source)
at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
While it might be difficult to eliminate these intermediate buffers, I
would propose to try reusing them.
What's happening here:
* the ARM64 ABI returns such a struct in two 128 bit registers v0/v1 [0]
* the VM stub calling convention around this expects an output buffer to
copy v0/v1 into: [1]
stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
* the FFM downcall calling convention OTOH expects a user-provided
SegmentAllocator to allocate a 16 byte StructLayout(JAVA_DOUBLE,
JAVA_DOUBLE). The generated method handle to adapt to the stub looks
roughly like this [2]:
ffm(allocator) {
* tmp = malloc(32)*
stub(tmp)
result = allocator.allocate(16)
result[0..8) = tmp[0..8)
result[8..16) = tmp[16..24)
* free(tmp)*
return result
}
Now there's an easy way around this for the user by using a different
native signature:
void g(Vector2D *out) { *out = f(); }
This eliminates the intermediate buffer altogether.
However, if we wanted to optimize the return-by-value path, I can think of
three options:
* enhance the stub calling conventions to directly copy only the narrowed
output registers into the result buffer. This looks rather involved.
* allocate the tmp buffer using the user's allocator as well (e.g. in
conjunction with the result + slicing). The Linker api is somewhat lenient
about how `allocator` will be exactly invoked: "used by the linker runtime
to allocate the memory region associated with the struct returned by the
downcall method handle". However, this may be surprising to the caller.
* keep the tmp buffer allocation internal, but optimize it. This is what
I'm proposing here.
A possible counter-argument could be "this is just one allocation out of
two". However, the user has control over `allocator`, and may re-use the
same segment across calls, but they have no control over the tmp allocation.
I've worked on a patch that takes this last route, using a one-element
thread-local cache: https://github.com/openjdk/jdk/pull/23142, it reduces
call time from 36->8ns / op on my machine and I observe no more GC's.
Would there be interest in pursuing this?
Thx
Matthias
[0]
https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
[1]
https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
[2] "binding context":
https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250116/63e157e2/attachment-0001.htm>
More information about the panama-dev
mailing list