RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations [v9]

Wed Jan 22 10:45:39 UTC 2025

On Wed, 22 Jan 2025 10:12:13 GMT, Matthias Ernst <duke at openjdk.org> wrote:

>> Certain signatures for foreign function calls (e.g. HVA return by value) require allocation of an intermediate buffer to adapt the FFM's to the native stub's calling convention. In the current implementation, this buffer is malloced and freed on every FFM invocation, a non-negligible overhead.
>> 
>> Sample stack trace:
>> 
>>    java.lang.Thread.State: RUNNABLE
>> 	at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
>> ...
>> 	at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
>> 	at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
>> ...
>> 	at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 25-ea/Invokers$Holder)
>> 
>> 
>> To alleviate this, this PR remembers and reuses up to two small intermediate buffers per carrier-thread in subsequent calls.
>> 
>> Performance (MBA M3):
>> 
>> 
>> Before:
>> Benchmark                    Mode  Cnt   Score   Error  Units
>> CallOverheadByValue.byPtr    avgt   10   3.333 ? 0.152  ns/op
>> CallOverheadByValue.byValue  avgt   10  33.892 ? 0.034  ns/op
>> 
>> After:
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> CallOverheadByValue.byPtr    avgt   10  3.291 ? 0.031  ns/op
>> CallOverheadByValue.byValue  avgt   10  5.464 ? 0.007  ns/op
>> 
>> 
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision:
> 
>   --unnecessary annotations

src/java.base/share/classes/jdk/internal/foreign/abi/BufferStack.java line 38:

> 36:         @SuppressWarnings("restricted")
> 37:         public MemorySegment allocate(long byteSize, long byteAlignment) {
> 38:             MemorySegment slice = backingSegment.asSlice(offset, byteSize, byteAlignment);

You have re-implemented a slicing allocator (`SegmentAllocator::slicingAllocator`). I think that's probably ok, but I'll point out that slicing allocator will try to deal with alignment, and will also throw exceptions when called with non-sensical arguments.

A more robust way to do this would be to:
1. have `reserve` pass the reserved size to `Frame`
2. `Frame` will slice the segment according to offset/size
3. then create a slicing allocator based on that segment
4. use the slicing allocator to implement `allocate`

In our tests, something like this did not add any extra overhead (the allocation of the slicing allocator is escape-analyzed away)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1925099911