RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations [v2]

Mon Jan 20 15:06:38 UTC 2025

On Mon, 20 Jan 2025 07:30:17 GMT, Matthias Ernst <duke at openjdk.org> wrote:

>> Certain signatures for foreign function calls (e.g. HVA return by value) require allocation of an intermediate buffer to adapt the FFM's to the native stub's calling convention. In the current implementation, this buffer is malloced and freed on every FFM invocation, a non-negligible overhead.
>> 
>> Sample stack trace:
>> 
>>    java.lang.Thread.State: RUNNABLE
>> 	at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
>> ...
>> 	at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
>> 	at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
>> ...
>> 	at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 25-ea/Invokers$Holder)
>> 
>> 
>> To alleviate this, this PR remembers and reuses up to two small intermediate buffers per carrier-thread in subsequent calls.
>> 
>> Performance (MBA M3):
>> 
>> 
>> master at 764d70b7df18e288582e616c62b0d7078f1ff3aa
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> PointsAlloc.circle_by_ptr         avgt   30    9.197 ?  0.037  ns/op
>> PointsAlloc.circle_by_value       avgt   30   42.195 ?  0.088  ns/op   <= #######
>> PointsAlloc.jni_ByteBuffer_alloc  avgt   30  226.127 ? 35.378  ns/op
>> PointsAlloc.jni_long_alloc        avgt   30   25.297 ?  2.457  ns/op
>> PointsAlloc.panama_alloc          avgt   30   27.053 ?  1.915  ns/op
>> 
>> After:
>> Benchmark                         Mode  Cnt    Score    Error  Units
>> PointsAlloc.circle_by_ptr         avgt   30    9.156 ?  0.021  ns/op
>> PointsAlloc.circle_by_value       avgt   30   11.995 ?  0.051  ns/op   <= #######
>> PointsAlloc.jni_ByteBuffer_alloc  avgt   30  211.161 ? 23.284  ns/op
>> PointsAlloc.jni_long_alloc        avgt   30   24.885 ?  2.461  ns/op
>> PointsAlloc.panama_alloc          avgt   30   26.905 ?  1.935  ns/op
>> 
>> 
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160 byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Implementation notes.

src/java.base/share/classes/jdk/internal/foreign/abi/CallBufferCache.java line 75:

> 73: 
> 74:     @SuppressWarnings("restricted")
> 75:     public static long allocate(long size) {

I don't think we should expose allocation/release methods (see related comment in `SharedUtils`)

src/java.base/share/classes/jdk/internal/foreign/abi/CallBufferCache.java line 97:

> 95:     public static long acquire() {
> 96:         // Protect against vthread unmount.
> 97:         Continuation.pin();

Other code that does this, first checks if we are in a virtual thread.
Note that we have two options for dealing with problematic cases:
* pin/unpin as soon as we see a virtual thread
* if we have a virtual thread and we detect a race in accessing the cache (e.g. because of how the virtual threads have been scheduled to the same underlying carrier thread), we could avoid pinning, but just allocate a fresh new segment (thus avoiding recycling issues)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922533447
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922536958