[FFM performance] Intermediate buffer allocation when returning structs

Thu Jan 16 11:29:05 UTC 2025

Hi Matthias,

Thank you for your mail!

We are working on a seemingly very related issue where we want to reuse a MemorySegment for "system calls" (e.g. open() and socket()) that is capturing the error number (in the reused MemorySegment).  We are seeing similar performance improvements as you have identified. Here, we are careful to make sure that we cover the case when a virtual thread is unmounted from its carrier during a potentially lengthy system call and another virtual thread is mounted on the original platform thread and subsequently invokes another system call.

Here is the PR: https://github.com/openjdk/jdk/pull/22391

Also, we are exploring having a perhaps even more performant reuse constructs compared to using TerminatingThreadLocal. This may or may not be included in the JDK down the line.

Please take a look at the PR above and feel free to comment there to.

I think there are synergies to harvest here.

Best, Per

________________________________
From: panama-dev <panama-dev-retn at openjdk.org> on behalf of Matthias Ernst <matthias at mernst.org>
Sent: Thursday, January 16, 2025 11:00 AM
To: panama-dev at openjdk.org <panama-dev at openjdk.org>
Subject: [FFM performance] Intermediate buffer allocation when returning structs

Hi, I noticed a source of overhead when calling foreign functions with small aggregate return values.

For example, a function returning a struct Vector2D { double x; double y } will cause a malloc/free inside the downcall handle on every call. On my machine, this accounts for about 80% of the call overhead.

Choice stack:

   java.lang.Thread.State: RUNNABLE
        at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
...
        at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
        at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
        at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)

While it might be difficult to eliminate these intermediate buffers, I would propose to try reusing them.

What's happening here:
* the ARM64 ABI returns such a struct in two 128 bit registers v0/v1 [0]
* the VM stub calling convention around this expects an output buffer to copy v0/v1 into:  [1]
stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
* the FFM downcall calling convention OTOH expects a user-provided SegmentAllocator to allocate a 16 byte StructLayout(JAVA_DOUBLE, JAVA_DOUBLE). The generated method handle to adapt to the stub looks roughly like this [2]:
 ffm(allocator) {
  tmp = malloc(32)
  stub(tmp)
  result = allocator.allocate(16)
  result[0..8) = tmp[0..8)
  result[8..16) = tmp[16..24)
  free(tmp)
  return result
}

Now there's an easy way around this for the user by using a different native signature:
void g(Vector2D *out) { *out = f(); }
This eliminates the intermediate buffer altogether.

However, if we wanted to optimize the return-by-value path, I can think of three options:
* enhance the stub calling conventions to directly copy only the narrowed output registers into the result buffer.  This looks rather involved.
* allocate the tmp buffer using the user's allocator as well (e.g. in conjunction with the result + slicing). The Linker api is somewhat lenient about how `allocator` will be exactly invoked: "used by the linker runtime to allocate the memory region associated with the struct returned by the downcall method handle".  However, this may be surprising to the caller.
* keep the tmp buffer allocation internal, but optimize it. This is what I'm proposing here.

A possible counter-argument could be "this is just one allocation out of two". However, the user has control over `allocator`, and may re-use the same segment across calls, but they have no control over the tmp allocation.

I've worked on a patch that takes this last route, using a one-element thread-local cache: https://github.com/openjdk/jdk/pull/23142, it reduces call time from 36->8ns / op on my machine and I observe no more GC's.

Would there be interest in pursuing this?

Thx
Matthias

[0] https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
[1] https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
[2] "binding context": https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250116/19b9b8a0/attachment.htm>