[FFM performance] Intermediate buffer allocation when returning structs

Fri Jan 17 13:37:07 UTC 2025

On Fri, Jan 17, 2025 at 1:21 PM Matthias Ernst <matthias at mernst.org> wrote:

> On Fri, Jan 17, 2025 at 1:09 AM Matthias Ernst <matthias at mernst.org>
> wrote:
>
>> Thanks very much for the feedback, Jorn!
>> I've incorporated the two-element cache, and avoid using a shared
>> session now (Unsafe instead).
>>
>> I'm not sure about using non-carrier ThreadLocals, I think a defining
>> quality is that you can only have as many (root) foreign function
>> invocations as you have carrier threads, so it is fitting. With virtual
>> threads you might allocate xxxx such buffers for nought.
>>
>> As to the confined session: it fits nicely into the implementation, but I
>> observe that it destroys one very nice property of the patch: without it,
>> at least my test downcall seems to become allocation-free (I see zero GC
>> activity in the benchmark), i.e. the "BoundedArea" and the buffer slices
>> seem to get completely scalar-replaced. As soon as I add a per-call
>> Arena.ofConfined() into the picture, I see plenty of GC activity and the
>> call-overhead goes up (but still way less than with malloc involved). I
>> haven't looked in detail into why that might be (I'm not very good with the
>> EA logs). I could argue this either way, but an allocation free foreign
>> call seems like a nice property, whereas I'm reasonably sure these tmp
>> buffers cannot escape the call? Is that maybe something that could be
>> enabled only with a debug flag?
>>
>
> I looked at this in more detail. The great news: I got the confined
> session on top of the carrier-local cache to be properly scalar-replaced.
> This now does everything we want: lock-free buffer acquisition, two cache
> entries, confinement while borrowed from the cache, and everything
> allocation free in 8ns roundtrip. I've updated the branch accordingly.
>
> The less great news: I seem to be running into a bug in escape analysis.
>

That one was easy to repro: calling Continuation.pin/unpin in a constructor
seems to confuse escape analysis. Please see:
https://github.com/mernst-github/repro/tree/main/escape-analysis

> Depending on where I place the ".ofConfined()" call in the BoundedArea
> constructor I get either:
> proper scalar replacement, but a crash in fastdebug:
>   #  Internal Error
> (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
> pid=85070, tid=26115
>   #  assert(false) failed: EA: missing memory path
> OR
>   fastdebug works, but fails to scalar replace the confined session.
>
> See comments in
> https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409
> :
>
>         public BoundedArena(long size) {
>             // When here, works in fastdebug, but not scalar-replaced:
>             //  scope = Arena.ofConfined();    <======================
>
>             MemorySegment cached = size <= BufferCache.CACHED_BUFFER_SIZE
> ? BufferCache.acquire() : null;
>
>             // When here, works in release build, but fastdebug crashes:
>             // #  Internal Error
> (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
> pid=85070, tid=26115
>             // #  assert(false) failed: EA: missing memory path
>             scope = Arena.ofConfined();    <======================
>
> Crash logs are attached.
>
>
>
>> Matthias
>>
>>
>> On Thu, Jan 16, 2025 at 6:26 PM Jorn Vernee <jorn.vernee at oracle.com>
>> wrote:
>>
>>> Hello Matthias,
>>>
>>> We've been exploring this direction internally as well. As you've found,
>>> downcall handles/upcall stubs sometimes need to allocate memory. The return
>>> buffer case that you've run into is one such case, others are: when a
>>> struct that does not fit into a single register is passed by value on
>>> Windows, we need to create a copy. When a struct is passed by value to an
>>> upcall stub, we need to allocate memory to hold the value.
>>>
>>> I took a look at your patch. One of the problems I see with a
>>> one-element cache is that some upcall stubs might never benefit from it,
>>> since a preceding downcall already claimed the cache. Though, I believe a
>>> chain of downcalls and upcalls is comparatively rare. A two element cache
>>> might be better. That way a sequence of downcall -> upcall, that both use
>>> by-value structs, will be able to benefit from the cache.
>>>
>>> Having a cache per carrier thread is probably a good idea. A cache per
>>> thread is also possibly an option, if the overhead seems acceptable (the
>>> cache is only initialized for threads that actually call native code after
>>> all). This would also be a little faster, I think.
>>>
>>> One thing that's unfortunate is the use of a shared arena, even in the
>>> fallback case, since closing that is very slow. Another problem is that
>>> with your current implementation, we are no longer tracking the lifetime of
>>> the memory correctly, and it is possible to access memory that was already
>>> returned to the cache. Using a proper lifetime (i.e. creating/closing a new
>>> arena per call) has helped to catch bugs in the past. If we want to keep
>>> doing that, we'd have to re-wrap the memory of the cache with a new arena
>>> (using MemorySegment::reinterpret), which we then close after a downcall,
>>> to return elements to the cache. I suggest restructuring the code so that
>>> it always creates  a new confined arena, as today, but then either: 1)
>>> grabs a memory segment from the cache, and attaches that to the new
>>> confined arena (using MS::reintrepret), or 2) in the case of a cache miss,
>>> just allocates a new segment from the confined arena we created.
>>>
>>> WDYT?
>>>
>>> Jorn
>>> On 16-1-2025 11:00, Matthias Ernst wrote:
>>>
>>> Hi, I noticed a source of overhead when calling foreign functions with
>>> small aggregate return values.
>>>
>>> For example, a function returning a struct Vector2D { double x; double
>>> y } will cause a malloc/free inside the downcall handle on every call.
>>> On my machine, this accounts for about 80% of the call overhead.
>>>
>>> Choice stack:
>>>
>>>    java.lang.Thread.State: RUNNABLE*	at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
>>> *...*	at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386 <java.base at 25-ea/SharedUtils.java:386>)
>>> *	at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
>>> 	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
>>>
>>> While it might be difficult to eliminate these intermediate buffers, I
>>> would propose to try reusing them.
>>>
>>> What's happening here:
>>> * the ARM64 ABI returns such a struct in two 128 bit registers v0/v1 [0]
>>> * the VM stub calling convention around this expects an output buffer to
>>> copy v0/v1 into:  [1]
>>> stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
>>> * the FFM downcall calling convention OTOH expects a user-provided
>>> SegmentAllocator to allocate a 16 byte StructLayout(JAVA_DOUBLE,
>>> JAVA_DOUBLE). The generated method handle to adapt to the stub looks
>>> roughly like this [2]:
>>>  ffm(allocator) {
>>> *  tmp = malloc(32)*
>>>   stub(tmp)
>>>   result = allocator.allocate(16)
>>>   result[0..8) = tmp[0..8)
>>>   result[8..16) = tmp[16..24)
>>> *  free(tmp)*
>>>   return result
>>> }
>>>
>>> Now there's an easy way around this for the user by using a different
>>> native signature:
>>> void g(Vector2D *out) { *out = f(); }
>>> This eliminates the intermediate buffer altogether.
>>>
>>> However, if we wanted to optimize the return-by-value path, I can think
>>> of three options:
>>> * enhance the stub calling conventions to directly copy only the
>>> narrowed output registers into the result buffer.  This looks rather
>>> involved.
>>> * allocate the tmp buffer using the user's allocator as well (e.g. in
>>> conjunction with the result + slicing). The Linker api is somewhat lenient
>>> about how `allocator` will be exactly invoked: "used by the linker runtime
>>> to allocate the memory region associated with the struct returned by the
>>> downcall method handle".  However, this may be surprising to the caller.
>>> * keep the tmp buffer allocation internal, but optimize it. This is what
>>> I'm proposing here.
>>>
>>> A possible counter-argument could be "this is just one allocation out of
>>> two". However, the user has control over `allocator`, and may re-use the
>>> same segment across calls, but they have no control over the tmp allocation.
>>>
>>> I've worked on a patch that takes this last route, using a one-element
>>> thread-local cache: https://github.com/openjdk/jdk/pull/23142, it
>>> reduces call time from 36->8ns / op on my machine and I observe no more
>>> GC's.
>>>
>>> Would there be interest in pursuing this?
>>>
>>> Thx
>>> Matthias
>>>
>>>
>>> [0]
>>> https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
>>> [1]
>>> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
>>> [2] "binding context":
>>> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/c4561cd2/attachment-0001.htm>