[FFM performance] Intermediate buffer allocation when returning structs

Fri Jan 17 12:21:35 UTC 2025

On Fri, Jan 17, 2025 at 1:09 AM Matthias Ernst <matthias at mernst.org> wrote:

> Thanks very much for the feedback, Jorn!
> I've incorporated the two-element cache, and avoid using a shared
> session now (Unsafe instead).
>
> I'm not sure about using non-carrier ThreadLocals, I think a defining
> quality is that you can only have as many (root) foreign function
> invocations as you have carrier threads, so it is fitting. With virtual
> threads you might allocate xxxx such buffers for nought.
>
> As to the confined session: it fits nicely into the implementation, but I
> observe that it destroys one very nice property of the patch: without it,
> at least my test downcall seems to become allocation-free (I see zero GC
> activity in the benchmark), i.e. the "BoundedArea" and the buffer slices
> seem to get completely scalar-replaced. As soon as I add a per-call
> Arena.ofConfined() into the picture, I see plenty of GC activity and the
> call-overhead goes up (but still way less than with malloc involved). I
> haven't looked in detail into why that might be (I'm not very good with the
> EA logs). I could argue this either way, but an allocation free foreign
> call seems like a nice property, whereas I'm reasonably sure these tmp
> buffers cannot escape the call? Is that maybe something that could be
> enabled only with a debug flag?
>

I looked at this in more detail. The great news: I got the confined session
on top of the carrier-local cache to be properly scalar-replaced. This now
does everything we want: lock-free buffer acquisition, two cache entries,
confinement while borrowed from the cache, and everything allocation free
in 8ns roundtrip. I've updated the branch accordingly.

The less great news: I seem to be running into a bug in escape analysis.
Depending on where I place the ".ofConfined()" call in the BoundedArea
constructor I get either:
proper scalar replacement, but a crash in fastdebug:
  #  Internal Error
(/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
pid=85070, tid=26115
  #  assert(false) failed: EA: missing memory path
OR
  fastdebug works, but fails to scalar replace the confined session.

See comments in
https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409
:

        public BoundedArena(long size) {
            // When here, works in fastdebug, but not scalar-replaced:
            //  scope = Arena.ofConfined();    <======================

            MemorySegment cached = size <= BufferCache.CACHED_BUFFER_SIZE ?
BufferCache.acquire() : null;

            // When here, works in release build, but fastdebug crashes:
            // #  Internal Error
(/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
pid=85070, tid=26115
            // #  assert(false) failed: EA: missing memory path
            scope = Arena.ofConfined();    <======================

Crash logs are attached.

> Matthias
>
>
> On Thu, Jan 16, 2025 at 6:26 PM Jorn Vernee <jorn.vernee at oracle.com>
> wrote:
>
>> Hello Matthias,
>>
>> We've been exploring this direction internally as well. As you've found,
>> downcall handles/upcall stubs sometimes need to allocate memory. The return
>> buffer case that you've run into is one such case, others are: when a
>> struct that does not fit into a single register is passed by value on
>> Windows, we need to create a copy. When a struct is passed by value to an
>> upcall stub, we need to allocate memory to hold the value.
>>
>> I took a look at your patch. One of the problems I see with a one-element
>> cache is that some upcall stubs might never benefit from it, since a
>> preceding downcall already claimed the cache. Though, I believe a chain of
>> downcalls and upcalls is comparatively rare. A two element cache might be
>> better. That way a sequence of downcall -> upcall, that both use by-value
>> structs, will be able to benefit from the cache.
>>
>> Having a cache per carrier thread is probably a good idea. A cache per
>> thread is also possibly an option, if the overhead seems acceptable (the
>> cache is only initialized for threads that actually call native code after
>> all). This would also be a little faster, I think.
>>
>> One thing that's unfortunate is the use of a shared arena, even in the
>> fallback case, since closing that is very slow. Another problem is that
>> with your current implementation, we are no longer tracking the lifetime of
>> the memory correctly, and it is possible to access memory that was already
>> returned to the cache. Using a proper lifetime (i.e. creating/closing a new
>> arena per call) has helped to catch bugs in the past. If we want to keep
>> doing that, we'd have to re-wrap the memory of the cache with a new arena
>> (using MemorySegment::reinterpret), which we then close after a downcall,
>> to return elements to the cache. I suggest restructuring the code so that
>> it always creates  a new confined arena, as today, but then either: 1)
>> grabs a memory segment from the cache, and attaches that to the new
>> confined arena (using MS::reintrepret), or 2) in the case of a cache miss,
>> just allocates a new segment from the confined arena we created.
>>
>> WDYT?
>>
>> Jorn
>> On 16-1-2025 11:00, Matthias Ernst wrote:
>>
>> Hi, I noticed a source of overhead when calling foreign functions with
>> small aggregate return values.
>>
>> For example, a function returning a struct Vector2D { double x; double y
>> } will cause a malloc/free inside the downcall handle on every call. On
>> my machine, this accounts for about 80% of the call overhead.
>>
>> Choice stack:
>>
>>    java.lang.Thread.State: RUNNABLE*	at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
>> *...*	at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386 <java.base at 25-ea/SharedUtils.java:386>)
>> *	at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
>> 	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
>>
>> While it might be difficult to eliminate these intermediate buffers, I
>> would propose to try reusing them.
>>
>> What's happening here:
>> * the ARM64 ABI returns such a struct in two 128 bit registers v0/v1 [0]
>> * the VM stub calling convention around this expects an output buffer to
>> copy v0/v1 into:  [1]
>> stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
>> * the FFM downcall calling convention OTOH expects a user-provided
>> SegmentAllocator to allocate a 16 byte StructLayout(JAVA_DOUBLE,
>> JAVA_DOUBLE). The generated method handle to adapt to the stub looks
>> roughly like this [2]:
>>  ffm(allocator) {
>> *  tmp = malloc(32)*
>>   stub(tmp)
>>   result = allocator.allocate(16)
>>   result[0..8) = tmp[0..8)
>>   result[8..16) = tmp[16..24)
>> *  free(tmp)*
>>   return result
>> }
>>
>> Now there's an easy way around this for the user by using a different
>> native signature:
>> void g(Vector2D *out) { *out = f(); }
>> This eliminates the intermediate buffer altogether.
>>
>> However, if we wanted to optimize the return-by-value path, I can think
>> of three options:
>> * enhance the stub calling conventions to directly copy only the narrowed
>> output registers into the result buffer.  This looks rather involved.
>> * allocate the tmp buffer using the user's allocator as well (e.g. in
>> conjunction with the result + slicing). The Linker api is somewhat lenient
>> about how `allocator` will be exactly invoked: "used by the linker runtime
>> to allocate the memory region associated with the struct returned by the
>> downcall method handle".  However, this may be surprising to the caller.
>> * keep the tmp buffer allocation internal, but optimize it. This is what
>> I'm proposing here.
>>
>> A possible counter-argument could be "this is just one allocation out of
>> two". However, the user has control over `allocator`, and may re-use the
>> same segment across calls, but they have no control over the tmp allocation.
>>
>> I've worked on a patch that takes this last route, using a one-element
>> thread-local cache: https://github.com/openjdk/jdk/pull/23142, it
>> reduces call time from 36->8ns / op on my machine and I observe no more
>> GC's.
>>
>> Would there be interest in pursuing this?
>>
>> Thx
>> Matthias
>>
>>
>> [0]
>> https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
>> [1]
>> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
>> [2] "binding context":
>> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/2100848e/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hs_err_pid54442.log
Type: application/octet-stream
Size: 97168 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/2100848e/hs_err_pid54442-0001.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: replay_pid54442.log
Type: application/octet-stream
Size: 408220 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/2100848e/replay_pid54442-0001.log>