[FFM performance] Intermediate buffer allocation when returning structs

Fri Jan 17 10:09:32 UTC 2025

Hi,
I believe there’s two main ways in which allocation can be improved, but 
they are relatively orthogonal.

  *

    the pooling allocator you proposed, which I agree is useful, seems
    perfect for cases where you want the same pool to be reused by
    multiple threads. In that sense, the pool acts as a sort of more
    general, full-blown allocator (e.g. like malloc). I believe the main
    use case here is I/O.

  *

    in some specialized cases, you want good thread-local allocator -
    possibily one that is able to recycle memory between repeated calls
    to the same functions. This comes up with the Linker, with errno,
    and all those cases where a Java method wants to allocate a segment,
    pass the segment to a native call, only to then destroy the segment.

I’m very skeptical that there exist /one good allocator/ that is equaly 
proficient in both use cases. For instance, a shared memory pool will 
unavoidably requires synchronization (CAS and the likes) to make sure 
that different threads cannot acquire the same region of memory. All 
these problems simply do not exist with a thread-local allocator. 
Moreover, because of the more “structured” (stack-confined) way in which 
thread-local allocators are used, they are typically much simpler to 
implement: typically a bump allocator that can be recyled (e.g. by 
resetting the start pointer) is enough, as there’s no contention.

This is an example on how to implement something like what I described 
above (we’ve been toying with it for quite some time):

https://github.com/openjdk/jdk/compare/master...mcimadamore:jdk:stack_allocator?expand=1

For instance, with the stack pool implenented in this class, we can do 
strlen as follows:

|Stack stack = Stack.newStack(); ... int strlen(String s) { try (Arena 
arena = stack.push()) { return (int) 
STRLEN.invokeExact(arena.allocateFrom(s)); } } |

This seems to work extremely well. The above branch has a new benchmark 
added to StrLenTest - here’s how it compares with the based version 
(which uses a vanilla confined arena):

|Benchmark (size) Mode Cnt Score Error Units 
StrLenTest.panama_strlen_alloc 100 avgt 30 37.338 ± 0.777 ns/op 
StrLenTest.panama_strlen_stack 100 avgt 30 18.206 ± 0.138 ns/op |

Using the stack pool is 2x faster, and there’s no GC activity 
(allocation of intermediate arenas are fully escape-analyzed away):

IMHO, an allocator like this would be ideal for something like the 
Linker. So, why isn't something like this in the JDK already?

The remaining problem we need to face is how to connect this allocator 
to a given thread. Ideally, each thread should have one such pool, which 
clients can obtain and use at will. While there's several ways of doing 
this, none feels "right":

  *

    using a ThreadLocal can be wasteful if each thread allocates a lot
    of memory, especially if (a) a thread allocates a lot of memory from
    the pool and (b) there is a large number of virtual threads.

  *

    ThreadLocal doesn’t allow for resources to be cleaned up — meaning
    that when a thread eventually exits, its pool will remain alive.
    Again, this could pose issues. (And, allowing custom cleanup on
    /all/ thread locals is a much bigger discussion)

  *

    A “carrier” ThreadLocal seems a better fit for the job, because the
    pool is associated to a carrier thread, not a virtual thread. And,
    carrier ThreadLocals allow for custom clean up policies. But, there
    is still the potential for two virtual threads to stomp on each
    other (e.g. thread 1 acquires an arena from the stack, then is
    unmounted, then thread 2 acquires and arena from same stack). So you
    end up back with some synchronization requirements (unless you can
    "pin" the virtual thread for the duration of the allocation activity).

  *

    A scope local is also a possibility, but it requires more radical
    changes to the code base. E.g. the scope local stack needs to be set
    up early in the application initialization, so that other clients
    can take advantage of it.

So, while I think we essentially solved the “allocation performance” 
side of the equation, I believe some work still needs to be done on the 
“how do we attach allocators to threads” side of the equation. My 
feeling is that we will probably end up combining the class I showed 
above with some of the functionalities of (platform?) ThreadLocal, to 
allow for efficient reuse.

I expect that experiments like the one Per mentioned on errno, or the 
discussion in this thread re. how do we make Linker allocation faster 
will, I think, help us tremendously in figuring out how to best 
associate allocators with threads.

Cheers
Maurizio

On 17/01/2025 09:11, rsmogura at icloud.com wrote:

> Hi all,
>
> Let me jump in here. During incubator phase of Panam, I proposed and 
> it was discussed the polling allocator [1] & [2], which later was 
> decided to not include it into code base of Panama / JDK.
>
> I wonder if it’s maybe a good time to think about it again, as it has 
> properties mentioned by Jorn (tracking allocation), and Matthias (fast 
> allocation and deallocation).
>
> The code it self was not updated for long time, but it used the double 
> Arena approach where one arena was used as allocator and poll, and 2nd 
> temporary arena was only used for returning segments back to poll.
>
> [1] https://mail.openjdk.org/pipermail/panama-dev/2021-April/013512.html
> [2] https://github.com/openjdk/panama-foreign/pull/509
>
> Best regards,
> Radosław Smogura
>
>> Wiadomość napisana przez Matthias Ernst <matthias at mernst.org> w dniu 
>> 17 sty 2025, o godz. 01:09:
>>
>> Thanks very much for the feedback, Jorn!
>> I've incorporated the two-element cache, and avoid using a shared 
>> session now (Unsafe instead).
>>
>> I'm not sure about using non-carrier ThreadLocals, I think a defining 
>> quality is that you can only have as many (root) foreign function 
>> invocations as you have carrier threads, so it is fitting. With 
>> virtual threads you might allocate xxxx such buffers for nought.
>>
>> As to the confined session: it fits nicely into the implementation, 
>> but I observe that it destroys one very nice property of the patch: 
>> without it, at least my test downcall seems to become allocation-free 
>> (I see zero GC activity in the benchmark), i.e. the "BoundedArea" and 
>> the buffer slices seem to get completely scalar-replaced. As soon as 
>> I add a per-call Arena.ofConfined() into the picture, I see plenty of 
>> GC activity and the call-overhead goes up (but still way less than 
>> with malloc involved). I haven't looked in detail into why that might 
>> be (I'm not very good with the EA logs). I could argue this either 
>> way, but an allocation free foreign call seems like a nice property, 
>> whereas I'm reasonably sure these tmp buffers cannot escape the call? 
>> Is that maybe something that could be enabled only with a debug flag?
>>
>> Matthias
>>
>>
>> On Thu, Jan 16, 2025 at 6:26 PM Jorn Vernee <jorn.vernee at oracle.com> 
>> wrote:
>>
>>     Hello Matthias,
>>
>>     We've been exploring this direction internally as well. As you've
>>     found, downcall handles/upcall stubs sometimes need to allocate
>>     memory. The return buffer case that you've run into is one such
>>     case, others are: when a struct that does not fit into a single
>>     register is passed by value on Windows, we need to create a copy.
>>     When a struct is passed by value to an upcall stub, we need to
>>     allocate memory to hold the value.
>>
>>     I took a look at your patch. One of the problems I see with a
>>     one-element cache is that some upcall stubs might never benefit
>>     from it, since a preceding downcall already claimed the cache.
>>     Though, I believe a chain of downcalls and upcalls is
>>     comparatively rare. A two element cache might be better. That way
>>     a sequence of downcall -> upcall, that both use by-value structs,
>>     will be able to benefit from the cache.
>>
>>     Having a cache per carrier thread is probably a good idea. A
>>     cache per thread is also possibly an option, if the overhead
>>     seems acceptable (the cache is only initialized for threads that
>>     actually call native code after all). This would also be a little
>>     faster, I think.
>>
>>     One thing that's unfortunate is the use of a shared arena, even
>>     in the fallback case, since closing that is very slow. Another
>>     problem is that with your current implementation, we are no
>>     longer tracking the lifetime of the memory correctly, and it is
>>     possible to access memory that was already returned to the cache.
>>     Using a proper lifetime (i.e. creating/closing a new arena per
>>     call) has helped to catch bugs in the past. If we want to keep
>>     doing that, we'd have to re-wrap the memory of the cache with a
>>     new arena (using MemorySegment::reinterpret), which we then close
>>     after a downcall, to return elements to the cache. I suggest
>>     restructuring the code so that it always creates  a new confined
>>     arena, as today, but then either: 1) grabs a memory segment from
>>     the cache, and attaches that to the new confined arena (using
>>     MS::reintrepret), or 2) in the case of a cache miss, just
>>     allocates a new segment from the confined arena we created.
>>
>>     WDYT?
>>
>>     Jorn
>>
>>     On 16-1-2025 11:00, Matthias Ernst wrote:
>>>     Hi, I noticed a source of overhead when calling foreign
>>>     functions with small aggregate return values.
>>>
>>>     For example, a function returning a struct Vector2D { double x;
>>>     double y } will cause a malloc/free inside the downcall handle
>>>     on every call. On my machine, this accounts for about 80% of the
>>>     call overhead.
>>>
>>>     Choice stack:
>>>     |java.lang.Thread.State: RUNNABLE *at
>>>     jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native
>>>     Method) *... *at
>>>     jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
>>>     *	at
>>>     jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown
>>>     Source) at
>>>     java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
>>>     |
>>>     While it might be difficult to eliminate these intermediate
>>>     buffers, I would propose to try reusing them.
>>>
>>>     What's happening here:
>>>     * the ARM64 ABI returns such a struct in two 128 bit registers
>>>     v0/v1 [0]
>>>     * the VM stub calling convention around this expects an output
>>>     buffer to copy v0/v1 into:  [1]
>>>     stub(out) { ... out[0..16) = v0; out[16..32) = v1; }
>>>     * the FFM downcall calling convention OTOH expects a
>>>     user-provided SegmentAllocator to allocate a 16 byte
>>>     StructLayout(JAVA_DOUBLE, JAVA_DOUBLE). The generated method
>>>     handle to adapt to the stub looks roughly like this [2]:
>>>      ffm(allocator) {
>>>     *  tmp = malloc(32)*
>>>       stub(tmp)
>>>       result = allocator.allocate(16)
>>>       result[0..8) = tmp[0..8)
>>>       result[8..16) = tmp[16..24)
>>>     *  free(tmp)*
>>>       return result
>>>     }
>>>
>>>     Now there's an easy way around this for the user by using a
>>>     different native signature:
>>>     void g(Vector2D *out) { *out = f(); }
>>>     This eliminates the intermediate buffer altogether.
>>>
>>>     However, if we wanted to optimize the return-by-value path, I
>>>     can think of three options:
>>>     * enhance the stub calling conventions to directly copy only the
>>>     narrowed output registers into the result buffer.  This looks
>>>     rather involved.
>>>     * allocate the tmp buffer using the user's allocator as well
>>>     (e.g. in conjunction with the result + slicing). The Linker api
>>>     is somewhat lenient about how `allocator` will be exactly
>>>     invoked: "used by the linker runtime to allocate the memory
>>>     region associated with the struct returned by the downcall
>>>     method handle".  However, this may be surprising to the caller.
>>>     * keep the tmp buffer allocation internal, but optimize it. This
>>>     is what I'm proposing here.
>>>
>>>     A possible counter-argument could be "this is just one
>>>     allocation out of two". However, the user has control over
>>>     `allocator`, and may re-use the same segment across calls, but
>>>     they have no control over the tmp allocation.
>>>
>>>     I've worked on a patch that takes this last route, using a
>>>     one-element thread-local cache:
>>>     https://github.com/openjdk/jdk/pull/23142, it reduces call time
>>>     from 36->8ns / op on my machine and I observe no more GC's.
>>>
>>>     Would there be interest in pursuing this?
>>>
>>>     Thx
>>>     Matthias
>>>
>>>
>>>     [0]
>>>     https://learn.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-170#return-values
>>>     [1]
>>>     https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/CallingSequenceBuilder.java#L97
>>>     [2] "binding context":
>>>     https://github.com/openjdk/jdk/blob/9c430c92257739730155df05f340fe144fd24098/src/java.base/share/classes/jdk/internal/foreign/abi/BindingSpecializer.java#L296
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250117/c956c53c/attachment-0001.htm>