[FFM performance] Intermediate buffer allocation when returning structs
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jan 17 13:48:20 UTC 2025
Thanks Matthias, adding loom-dev
Maurizio
On 17/01/2025 13:37, Matthias Ernst wrote:
>
>
> On Fri, Jan 17, 2025 at 1:21 PM Matthias Ernst <matthias at mernst.org>
> wrote:
>
> On Fri, Jan 17, 2025 at 1:09 AM Matthias Ernst
> <matthias at mernst.org> wrote:
>
> Thanks very much for the feedback, Jorn!
> I've incorporated the two-element cache, and avoid using a
> shared session now (Unsafe instead).
>
> I'm not sure about using non-carrier ThreadLocals, I think a
> defining quality is that you can only have as many (root)
> foreign function invocations as you have carrier threads, so
> it is fitting. With virtual threads you might allocate xxxx
> such buffers for nought.
>
> As to the confined session: it fits nicely into the
> implementation, but I observe that it destroys one very nice
> property of the patch: without it, at least my test downcall
> seems to become allocation-free (I see zero GC activity in the
> benchmark), i.e. the "BoundedArea" and the buffer slices seem
> to get completely scalar-replaced. As soon as I add a per-call
> Arena.ofConfined() into the picture, I see plenty of GC
> activity and the call-overhead goes up (but still way less
> than with malloc involved). I haven't looked in detail into
> why that might be (I'm not very good with the EA logs). I
> could argue this either way, but an allocation free foreign
> call seems like a nice property, whereas I'm reasonably sure
> these tmp buffers cannot escape the call? Is that maybe
> something that could be enabled only with a debug flag?
>
>
> I looked at this in more detail. The great news: I got the
> confined session on top of the carrier-local cache to be properly
> scalar-replaced. This now does everything we want: lock-free
> buffer acquisition, two cache entries, confinement while borrowed
> from the cache, and everything allocation free in 8ns roundtrip.
> I've updated the branch accordingly.
>
> The less great news: I seem to be running into a bug in escape
> analysis.
>
>
> That one was easy to repro: calling Continuation.pin/unpin in a
> constructor seems to confuse escape analysis. Please see:
> https://github.com/mernst-github/repro/tree/main/escape-analysis
>
> Depending on where I place the ".ofConfined()" call in the
> BoundedArea constructor I get either:
> proper scalar replacement, but a crash in fastdebug:
> # Internal Error
> (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
> pid=85070, tid=26115
> # assert(false) failed: EA: missing memory path
> OR
> fastdebug works, but fails to scalar replace the confined session.
>
> See comments in
> https://github.com/openjdk/jdk/pull/23142/files#diff-80b3987494fdd3ed20ced0248adbf6097432e24db8a2fb8476bbf2143bd0a2c3R401-R409:
>
> public BoundedArena(long size) {
> // When here, works in fastdebug, but not scalar-replaced:
> // scope = Arena.ofConfined(); <======================
>
> MemorySegment cached = size <=
> BufferCache.CACHED_BUFFER_SIZE ? BufferCache.acquire() : null;
>
> // When here, works in release build, but fastdebug
> crashes:
> // # Internal Error
> (/Users/mernst/IdeaProjects/jdk/src/hotspot/share/opto/escape.cpp:4767),
> pid=85070, tid=26115
> // # assert(false) failed: EA: missing memory path
> scope = Arena.ofConfined(); <======================
>
> Crash logs are attached.
>
>
>
> Matthias
>
>
> On Thu, Jan 16, 2025 at 6:26 PM Jorn Vernee
> <jorn.vernee at oracle.com> wrote:
>
> Hello Matthias,
>
> We've been exploring this direction internally as well. As
> you've found, downcall handles/upcall stubs sometimes need
> to allocate memory. The return buffer case that you've run
> into is one such case, others are: when a struct that does
> not fit into a single register is passed by value on
> Windows, we need to create a copy. When a struct is passed
> by value to an upcall stub, we need to allocate memory to
> hold the value.
>
> I took a look at your patch. One of the problems I see
> with a one-element cache is that some upcall stubs might
> never benefit from it, since a preceding downcall already
> claimed the cache. Though, I believe a chain of downcalls
> and upcalls is comparatively rare. A two element cache
> might be better. That way a sequence of downcall ->
> upcall, that both use by-value structs, will be able to
> benefit from the cache.
>
> Having a cache per carrier thread is probably a good idea.
> A cache per thread is also possibly an option, if the
> overhead seems acceptable (the cache is only initialized
> for threads that actually call native code after all).
> This would also be a little faster, I think.
>
> One thing that's unfortunate is the use of a shared arena,
> even in the fallback case, since closing that is very
> slow. Another problem is that with your current
> implementation, we are no longer tracking the lifetime of
> the memory correctly, and it is possible to access memory
> that was already returned to the cache. Using a proper
> lifetime (i.e. creating/closing a new arena per call) has
> helped to catch bugs in the past. If we want to keep doing
> that, we'd have to re-wrap the memory of the cache with a
> new arena (using MemorySegment::reinterpret), which we
> then close after a downcall, to return elements to the
> cache. I suggest restructuring the code so that it always
> creates a new confined arena, as today, but then either:
> 1) grabs a memory segment from the cache, and attaches
> that to the new confined arena (using MS::reintrepret), or
> 2) in the case of a cache miss, just allocates a new
> segment from the confined arena we created.
>
> WDYT?
>
> Jorn
>
> On 16-1-2025 11:00, Matthias Ernst wrote:
>> Hi, I noticed a source of overhead when calling foreign
>> functions with small aggregate return values.
>>
>> For example, a function returning a struct Vector2D {
>> double x; double y } will cause a malloc/free inside the
>> downcall handle on every call. On my machine, this
>> accounts for about 80% of the call overhead.
>>
>> Choice stack:
>> |java.lang.Thread.State: RUNNABLE *at
>> jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native
>> Method) *... *at
>> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
>> * at
>> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown
>> Source) at
>> java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
>> |
>> While it might be difficult to eliminate these
>> intermediate buffers, I would propose to try reusing them.
>>
>> What's happening here:
>> * the ARM64 ABI returns such a struct in two 128 bit
>> registers v0/v1 [0]
>> * the VM stub calling convention around this expects an
>> output buffer to copy v0/v1 into: [1]
>> stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
>> * the FFM downcall calling convention OTOH expects a
>> user-provided SegmentAllocator to allocate a 16 byte
>> StructLayout(JAVA_DOUBLE, JAVA_DOUBLE). The generated
>> method handle to adapt to the stub looks roughly like
>> this [2]:
>> ffm(allocator) {
>> * tmp = malloc(32)*
>> stub(tmp)
>> result = allocator.allocate(16)
>> result[0..8) = tmp[0..8)
>> result[8..16) = tmp[16..24)
>> * free(tmp)*
>> return result
>> }
>>
>> Now there's an easy way around this for the user by using
>> a different native signature:
>> void g(Vector2D *out) { *out = f(); }
>> This eliminates the intermediate buffer altogether.
>>
>> However, if we wanted to optimize the return-by-value
>> path, I can think of three options:
>> * enhance the stub calling conventions to directly copy
>> only the narrowed output registers into the result
>> buffer. This looks rather involved.
>> * allocate the tmp buffer using the user's allocator as
>> well (e.g. in conjunction with the result + slicing). The
>> Linker api is somewhat lenient about how `allocator` will
>> be exactly invoked: "used by the linker runtime to
>> allocate the memory region associated with the struct
>> returned by the downcall method handle". However, this
>> may be surprising to the caller.
>> * keep the tmp buffer allocation internal, but optimize
>> it. This is what I'm proposing here.
>>
>> A possible counter-argument could be "this is just one
>> allocation out of two". However, the user has control
>> over `allocator`, and may re-use the same segment across
>> calls, but they have no control over the tmp allocation.
>>
>> I've worked on a patch that takes this last route, using
>> a one-element thread-local cache:
>> https://github.com/openjdk/jdk/pull/23142, it reduces
>> call time from 36->8ns / op on my machine and I observe
>> no more GC's.
>>
>> Would there be interest in pursuing this?
>>
>> Thx
>> Matthias
>>
>>
>> [0]
>> https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
>> [1]
>> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
>> [2] "binding context":
>> https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/a1bbf14b/attachment-0001.htm>
More information about the panama-dev
mailing list