[FFM performance] Intermediate buffer allocation when returning structs

Fri Jan 17 00:09:17 UTC 2025

Thanks very much for the feedback, Jorn!
I've incorporated the two-element cache, and avoid using a shared
session now (Unsafe instead).

I'm not sure about using non-carrier ThreadLocals, I think a defining
quality is that you can only have as many (root) foreign function
invocations as you have carrier threads, so it is fitting. With virtual
threads you might allocate xxxx such buffers for nought.

As to the confined session: it fits nicely into the implementation, but I
observe that it destroys one very nice property of the patch: without it,
at least my test downcall seems to become allocation-free (I see zero GC
activity in the benchmark), i.e. the "BoundedArea" and the buffer slices
seem to get completely scalar-replaced. As soon as I add a per-call
Arena.ofConfined() into the picture, I see plenty of GC activity and the
call-overhead goes up (but still way less than with malloc involved). I
haven't looked in detail into why that might be (I'm not very good with the
EA logs). I could argue this either way, but an allocation free foreign
call seems like a nice property, whereas I'm reasonably sure these tmp
buffers cannot escape the call? Is that maybe something that could be
enabled only with a debug flag?

Matthias

On Thu, Jan 16, 2025 at 6:26 PM Jorn Vernee <jorn.vernee at oracle.com> wrote:

> Hello Matthias,
>
> We've been exploring this direction internally as well. As you've found,
> downcall handles/upcall stubs sometimes need to allocate memory. The return
> buffer case that you've run into is one such case, others are: when a
> struct that does not fit into a single register is passed by value on
> Windows, we need to create a copy. When a struct is passed by value to an
> upcall stub, we need to allocate memory to hold the value.
>
> I took a look at your patch. One of the problems I see with a one-element
> cache is that some upcall stubs might never benefit from it, since a
> preceding downcall already claimed the cache. Though, I believe a chain of
> downcalls and upcalls is comparatively rare. A two element cache might be
> better. That way a sequence of downcall -> upcall, that both use by-value
> structs, will be able to benefit from the cache.
>
> Having a cache per carrier thread is probably a good idea. A cache per
> thread is also possibly an option, if the overhead seems acceptable (the
> cache is only initialized for threads that actually call native code after
> all). This would also be a little faster, I think.
>
> One thing that's unfortunate is the use of a shared arena, even in the
> fallback case, since closing that is very slow. Another problem is that
> with your current implementation, we are no longer tracking the lifetime of
> the memory correctly, and it is possible to access memory that was already
> returned to the cache. Using a proper lifetime (i.e. creating/closing a new
> arena per call) has helped to catch bugs in the past. If we want to keep
> doing that, we'd have to re-wrap the memory of the cache with a new arena
> (using MemorySegment::reinterpret), which we then close after a downcall,
> to return elements to the cache. I suggest restructuring the code so that
> it always creates  a new confined arena, as today, but then either: 1)
> grabs a memory segment from the cache, and attaches that to the new
> confined arena (using MS::reintrepret), or 2) in the case of a cache miss,
> just allocates a new segment from the confined arena we created.
>
> WDYT?
>
> Jorn
> On 16-1-2025 11:00, Matthias Ernst wrote:
>
> Hi, I noticed a source of overhead when calling foreign functions with
> small aggregate return values.
>
> For example, a function returning a struct Vector2D { double x; double y }
> will cause a malloc/free inside the downcall handle on every call. On my
> machine, this accounts for about 80% of the call overhead.
>
> Choice stack:
>
>    java.lang.Thread.State: RUNNABLE*	at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
> *...*	at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386 <java.base at 25-ea/SharedUtils.java:386>)
> *	at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
> 	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
>
> While it might be difficult to eliminate these intermediate buffers, I
> would propose to try reusing them.
>
> What's happening here:
> * the ARM64 ABI returns such a struct in two 128 bit registers v0/v1 [0]
> * the VM stub calling convention around this expects an output buffer to
> copy v0/v1 into:  [1]
> stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
> * the FFM downcall calling convention OTOH expects a user-provided
> SegmentAllocator to allocate a 16 byte StructLayout(JAVA_DOUBLE,
> JAVA_DOUBLE). The generated method handle to adapt to the stub looks
> roughly like this [2]:
>  ffm(allocator) {
> *  tmp = malloc(32)*
>   stub(tmp)
>   result = allocator.allocate(16)
>   result[0..8) = tmp[0..8)
>   result[8..16) = tmp[16..24)
> *  free(tmp)*
>   return result
> }
>
> Now there's an easy way around this for the user by using a different
> native signature:
> void g(Vector2D *out) { *out = f(); }
> This eliminates the intermediate buffer altogether.
>
> However, if we wanted to optimize the return-by-value path, I can think of
> three options:
> * enhance the stub calling conventions to directly copy only the narrowed
> output registers into the result buffer.  This looks rather involved.
> * allocate the tmp buffer using the user's allocator as well (e.g. in
> conjunction with the result + slicing). The Linker api is somewhat lenient
> about how `allocator` will be exactly invoked: "used by the linker runtime
> to allocate the memory region associated with the struct returned by the
> downcall method handle".  However, this may be surprising to the caller.
> * keep the tmp buffer allocation internal, but optimize it. This is what
> I'm proposing here.
>
> A possible counter-argument could be "this is just one allocation out of
> two". However, the user has control over `allocator`, and may re-use the
> same segment across calls, but they have no control over the tmp allocation.
>
> I've worked on a patch that takes this last route, using a one-element
> thread-local cache: https://github.com/openjdk/jdk/pull/23142, it reduces
> call time from 36->8ns / op on my machine and I observe no more GC's.
>
> Would there be interest in pursuing this?
>
> Thx
> Matthias
>
>
> [0]
> https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
> [1]
> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
> [2] "binding context":
> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/92e5cba9/attachment-0001.htm>