RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations [v8]
Matthias Ernst
duke at openjdk.org
Wed Jan 22 10:04:40 UTC 2025
On Wed, 22 Jan 2025 09:57:15 GMT, Matthias Ernst <duke at openjdk.org> wrote:
>> Certain signatures for foreign function calls (e.g. HVA return by value) require allocation of an intermediate buffer to adapt the FFM's to the native stub's calling convention. In the current implementation, this buffer is malloced and freed on every FFM invocation, a non-negligible overhead.
>>
>> Sample stack trace:
>>
>> java.lang.Thread.State: RUNNABLE
>> at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
>> ...
>> at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
>> at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
>> ...
>> at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 25-ea/Invokers$Holder)
>>
>>
>> To alleviate this, this PR remembers and reuses up to two small intermediate buffers per carrier-thread in subsequent calls.
>>
>> Performance (MBA M3):
>>
>>
>> Before:
>> Benchmark Mode Cnt Score Error Units
>> CallOverheadByValue.byPtr avgt 10 3.333 ? 0.152 ns/op
>> CallOverheadByValue.byValue avgt 10 33.892 ? 0.034 ns/op
>>
>> After:
>> Benchmark Mode Cnt Score Error Units
>> CallOverheadByValue.byPtr avgt 10 3.291 ? 0.031 ns/op
>> CallOverheadByValue.byValue avgt 10 5.464 ? 0.007 ns/op
>>
>>
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision:
>
> Back buffer allocation with a single carrier-local segment.
> just need a single buffer
> Alternatively we can use locking
I think these are really really great suggestions, thank you!
It simplifies things tremendously, I've pushed a version of it.
As you say, the errno / state capture piece can probably just use it, too.
The extra atomics for acquiring/releasing don't seem to cost that much, so this has still excellent performance (and is also alloc-free):
Benchmark Mode Cnt Score Error Units
CallOverheadByValue.byPtr avgt 30 3.375 ? 0.138 ns/op
CallOverheadByValue.byValue avgt 30 6.625 ? 0.057 ns/op
I'll leave this here for inspiration, I'll add a few unit tests for the stack, but feel free to just close it in favor of related work.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23142#issuecomment-2606794554
More information about the core-libs-dev
mailing list