RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations [v7]
Jorn Vernee
jvernee at openjdk.org
Tue Jan 21 17:07:45 UTC 2025
On Mon, 20 Jan 2025 18:43:54 GMT, Matthias Ernst <duke at openjdk.org> wrote:
>> Certain signatures for foreign function calls (e.g. HVA return by value) require allocation of an intermediate buffer to adapt the FFM's to the native stub's calling convention. In the current implementation, this buffer is malloced and freed on every FFM invocation, a non-negligible overhead.
>>
>> Sample stack trace:
>>
>> java.lang.Thread.State: RUNNABLE
>> at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
>> ...
>> at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
>> at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
>> ...
>> at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 25-ea/Invokers$Holder)
>>
>>
>> To alleviate this, this PR remembers and reuses up to two small intermediate buffers per carrier-thread in subsequent calls.
>>
>> Performance (MBA M3):
>>
>>
>> Before:
>> Benchmark Mode Cnt Score Error Units
>> CallOverheadByValue.byPtr avgt 10 3.333 ? 0.152 ns/op
>> CallOverheadByValue.byValue avgt 10 33.892 ? 0.034 ns/op
>>
>> After:
>> Benchmark Mode Cnt Score Error Units
>> CallOverheadByValue.byPtr avgt 10 3.291 ? 0.031 ns/op
>> CallOverheadByValue.byValue avgt 10 5.464 ? 0.007 ns/op
>>
>>
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision:
>
> restore 3 forks
Talking to Maurizio offline, and we realized that if we just pin the continuation when we acquire the buffer, and unpin when releasing, we don't have to worry about buffers floating between threads between acquire & release, and we can also re-use the buffer in consecutive calls (like a bump allocator), meaning we just need a single buffer, instead of a two element cache, and we might be able to use it for more than 2 calls. Pinning the continuation wouldn't be a problem since we're about to do a native call any way, which will also pin it.
We would need to wait until: https://bugs.openjdk.org/browse/JDK-8347997 is fixed, which seems like a good idea either way, so we have more options when implementing this.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23142#issuecomment-2605284762
More information about the core-libs-dev
mailing list