[FFM performance] Intermediate buffer allocation when returning structs

Fri Jan 17 13:48:20 UTC 2025

Thanks Matthias, adding loom-dev

Maurizio

On 17/01/2025 13:37, Matthias Ernst wrote:
>
>
> On Fri, Jan 17, 2025 at 1:21 PM Matthias Ernst <matthias at mernst.org> 
> wrote:
>
>     On Fri, Jan 17, 2025 at 1:09 AM Matthias Ernst
>     <matthias at mernst.org> wrote:
>
>         Thanks very much for the feedback, Jorn!
>         I've incorporated the two-element cache, and avoid using a
>         shared session now (Unsafe instead).
>
>         I'm not sure about using non-carrier ThreadLocals, I think a
>         defining quality is that you can only have as many (root)
>         foreign function invocations as you have carrier threads, so
>         it is fitting. With virtual threads you might allocate xxxx
>         such buffers for nought.
>
>         As to the confined session: it fits nicely into the
>         implementation, but I observe that it destroys one very nice
>         property of the patch: without it, at least my test downcall
>         seems to become allocation-free (I see zero GC activity in the
>         benchmark), i.e. the "BoundedArea" and the buffer slices seem
>         to get completely scalar-replaced. As soon as I add a per-call
>         Arena.ofConfined() into the picture, I see plenty of GC
>         activity and the call-overhead goes up (but still way less
>         than with malloc involved). I haven't looked in detail into
>         why that might be (I'm not very good with the EA logs). I
>         could argue this either way, but an allocation free foreign
>         call seems like a nice property, whereas I'm reasonably sure
>         these tmp buffers cannot escape the call? Is that maybe
>         something that could be enabled only with a debug flag?
>
>
>     I looked at this in more detail. The great news: I got the
>     confined session on top of the carrier-local cache to be properly
>     scalar-replaced. This now does everything we want: lock-free
>     buffer acquisition, two cache entries, confinement while borrowed
>     from the cache, and everything allocation free in 8ns roundtrip.
>     I've updated the branch accordingly.
>
>     The less great news: I seem to be running into a bug in escape
>     analysis.
>
>
> That one was easy to repro: calling Continuation.pin/unpin in a 
> constructor seems to confuse escape analysis. Please see:
> https://github.com/mernst-github/repro/tree/main/escape-analysis
>
>     Depending on where I place the ".ofConfined()" call in the
>     BoundedArea constructor I get either:
>     proper scalar replacement, but a crash in fastdebug:
>       #  Internal Error
>     (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
>     pid=85070, tid=26115
>       #  assert(false) failed: EA: missing memory path
>     OR
>       fastdebug works, but fails to scalar replace the confined session.
>
>     See comments in
>     https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409:
>
>             public BoundedArena(long size) {
>                 // When here, works in fastdebug, but not scalar-replaced:
>                 //  scope = Arena.ofConfined(); <======================
>
>                 MemorySegment cached = size <=
>     BufferCache.CACHED_BUFFER_SIZE ? BufferCache.acquire() : null;
>
>                 // When here, works in release build, but fastdebug
>     crashes:
>                 // #  Internal Error
>     (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
>     pid=85070, tid=26115
>                 // #  assert(false) failed: EA: missing memory path
>                 scope = Arena.ofConfined(); <======================
>
>     Crash logs are attached.
>
>
>
>         Matthias
>
>
>         On Thu, Jan 16, 2025 at 6:26 PM Jorn Vernee
>         <jorn.vernee at oracle.com> wrote:
>
>             Hello Matthias,
>
>             We've been exploring this direction internally as well. As
>             you've found, downcall handles/upcall stubs sometimes need
>             to allocate memory. The return buffer case that you've run
>             into is one such case, others are: when a struct that does
>             not fit into a single register is passed by value on
>             Windows, we need to create a copy. When a struct is passed
>             by value to an upcall stub, we need to allocate memory to
>             hold the value.
>
>             I took a look at your patch. One of the problems I see
>             with a one-element cache is that some upcall stubs might
>             never benefit from it, since a preceding downcall already
>             claimed the cache. Though, I believe a chain of downcalls
>             and upcalls is comparatively rare. A two element cache
>             might be better. That way a sequence of downcall ->
>             upcall, that both use by-value structs, will be able to
>             benefit from the cache.
>
>             Having a cache per carrier thread is probably a good idea.
>             A cache per thread is also possibly an option, if the
>             overhead seems acceptable (the cache is only initialized
>             for threads that actually call native code after all).
>             This would also be a little faster, I think.
>
>             One thing that's unfortunate is the use of a shared arena,
>             even in the fallback case, since closing that is very
>             slow. Another problem is that with your current
>             implementation, we are no longer tracking the lifetime of
>             the memory correctly, and it is possible to access memory
>             that was already returned to the cache. Using a proper
>             lifetime (i.e. creating/closing a new arena per call) has
>             helped to catch bugs in the past. If we want to keep doing
>             that, we'd have to re-wrap the memory of the cache with a
>             new arena (using MemorySegment::reinterpret), which we
>             then close after a downcall, to return elements to the
>             cache. I suggest restructuring the code so that it always
>             creates  a new confined arena, as today, but then either:
>             1) grabs a memory segment from the cache, and attaches
>             that to the new confined arena (using MS::reintrepret), or
>             2) in the case of a cache miss, just allocates a new
>             segment from the confined arena we created.
>
>             WDYT?
>
>             Jorn
>
>             On 16-1-2025 11:00, Matthias Ernst wrote:
>>             Hi, I noticed a source of overhead when calling foreign
>>             functions with small aggregate return values.
>>
>>             For example, a function returning a struct Vector2D {
>>             double x; double y } will cause a malloc/free inside the
>>             downcall handle on every call. On my machine, this
>>             accounts for about 80% of the call overhead.
>>
>>             Choice stack:
>>             |java.lang.Thread.State: RUNNABLE *at
>>             jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native
>>             Method) *... *at
>>             jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
>>             *	at
>>             jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown
>>             Source) at
>>             java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
>>             |
>>             While it might be difficult to eliminate these
>>             intermediate buffers, I would propose to try reusing them.
>>
>>             What's happening here:
>>             * the ARM64 ABI returns such a struct in two 128 bit
>>             registers v0/v1 [0]
>>             * the VM stub calling convention around this expects an
>>             output buffer to copy v0/v1 into:  [1]
>>             stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
>>             * the FFM downcall calling convention OTOH expects a
>>             user-provided SegmentAllocator to allocate a 16 byte
>>             StructLayout(JAVA_DOUBLE, JAVA_DOUBLE). The generated
>>             method handle to adapt to the stub looks roughly like
>>             this [2]:
>>              ffm(allocator) {
>>             *  tmp = malloc(32)*
>>               stub(tmp)
>>               result = allocator.allocate(16)
>>               result[0..8) = tmp[0..8)
>>               result[8..16) = tmp[16..24)
>>             *  free(tmp)*
>>               return result
>>             }
>>
>>             Now there's an easy way around this for the user by using
>>             a different native signature:
>>             void g(Vector2D *out) { *out = f(); }
>>             This eliminates the intermediate buffer altogether.
>>
>>             However, if we wanted to optimize the return-by-value
>>             path, I can think of three options:
>>             * enhance the stub calling conventions to directly copy
>>             only the narrowed output registers into the result
>>             buffer.  This looks rather involved.
>>             * allocate the tmp buffer using the user's allocator as
>>             well (e.g. in conjunction with the result + slicing). The
>>             Linker api is somewhat lenient about how `allocator` will
>>             be exactly invoked: "used by the linker runtime to
>>             allocate the memory region associated with the struct
>>             returned by the downcall method handle".  However, this
>>             may be surprising to the caller.
>>             * keep the tmp buffer allocation internal, but optimize
>>             it. This is what I'm proposing here.
>>
>>             A possible counter-argument could be "this is just one
>>             allocation out of two". However, the user has control
>>             over `allocator`, and may re-use the same segment across
>>             calls, but they have no control over the tmp allocation.
>>
>>             I've worked on a patch that takes this last route, using
>>             a one-element thread-local cache:
>>             https://github.com/openjdk/jdk/pull/23142, it reduces
>>             call time from 36->8ns / op on my machine and I observe
>>             no more GC's.
>>
>>             Would there be interest in pursuing this?
>>
>>             Thx
>>             Matthias
>>
>>
>>             [0]
>>             https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
>>             [1]
>>             https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
>>             [2] "binding context":
>>             https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/a1bbf14b/attachment-0001.htm>