RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations
Matthias Ernst
duke at openjdk.org
Sun Jan 19 21:09:15 UTC 2025
Certain signatures for foreign function calls require allocation of an intermediate buffer to adapt the FFM's to the native stub's calling convention ("needsReturnBuffer"). In the current implementation, this buffer is malloced and freed on every FFM invocation, a non-negligible overhead.
Sample stack trace:
java.lang.Thread.State: RUNNABLE
at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
at jdk.internal.misc.Unsafe.allocateMemory(java.base at 25-ea/Unsafe.java:636)
at jdk.internal.foreign.SegmentFactories.allocateMemoryWrapper(java.base at 25-ea/SegmentFactories.java:215)
at jdk.internal.foreign.SegmentFactories.allocateSegment(java.base at 25-ea/SegmentFactories.java:193)
at jdk.internal.foreign.ArenaImpl.allocateNoInit(java.base at 25-ea/ArenaImpl.java:55)
at jdk.internal.foreign.ArenaImpl.allocate(java.base at 25-ea/ArenaImpl.java:60)
at jdk.internal.foreign.ArenaImpl.allocate(java.base at 25-ea/ArenaImpl.java:34)
at java.lang.foreign.SegmentAllocator.allocate(java.base at 25-ea/SegmentAllocator.java:645)
at jdk.internal.foreign.abi.SharedUtils$2.<init>(java.base at 25-ea/SharedUtils.java:388)
at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
at java.lang.invoke.LambdaForm$MH/0x000001f00109a400.invoke(java.base at 25-ea/LambdaForm$MH)
at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 25-ea/Invokers$Holder)
When does this happen? A fairly easy way to trigger this is through returning a small aggregate like the following:
struct Vector2D {
double x, y;
};
Vector2D Origin() {
return {0, 0};
}
On AArch64, such a struct is returned in two 128 bit registers v0/v1.
The VM's calling convention for the native stub consequently expects an 32 byte output segment argument.
The FFM downcall method handle instead expects to create a 16 byte result segment through the application-provided SegmentAllocator, and needs to perform an appropriate adaptation, roughly like so:
MemorySegment downcallMH(SegmentAllocator a) {
MemorySegment tmp = SharedUtils.allocate(32);
try {
nativeStub.invoke(tmp); // leaves v0, v1 in tmp
MemorySegment result = a.allocate(16);
result.setDouble(0, tmp.getDouble(0));
result.setDouble(8, tmp.getDouble(16));
return result;
} finally {
free(tmp);
}
}
You might argue that this cost is not worse than what happens through the result allocator. However, the application has control over this, and may provide a segment-reusing allocator in a loop:
MemorySegment result = allocate(resultLayout);
SegmentAllocator allocator = (_, _)->result;
loop:
mh.invoke(allocator); <= would like to avoid hidden allocations in here
To alleviate this, This PR remembers and reuses one single such intermediate buffer per carrier-thread in subsequent calls, very similar to what happens in the sun.nio.ch.Util.BufferCache or sun.nio.fs.NativeBuffers, which face a similar issues.
Performance (MBA M3):
# VM version: JDK 25-ea, OpenJDK 64-Bit Server VM, 25-ea+3-283
Benchmark Mode Cnt Score Error Units
PointsAlloc.circle_by_ptr avgt 5 8.964 ± 0.351 ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate avgt 5 95.301 ± 3.665 MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001 B/op
PointsAlloc.circle_by_ptr:·gc.count avgt 5 2.000 counts
PointsAlloc.circle_by_ptr:·gc.time avgt 5 3.000 ms
PointsAlloc.circle_by_value avgt 5 46.498 ± 2.336 ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate avgt 5 13141.578 ± 650.425 MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm avgt 5 160.224 ± 0.001 B/op
PointsAlloc.circle_by_value:·gc.count avgt 5 116.000 counts
PointsAlloc.circle_by_value:·gc.time avgt 5 44.000 ms
# VM version: JDK 25-internal, OpenJDK 64-Bit Server VM, 25-internal-adhoc.mernst.jdk
Benchmark Mode Cnt Score Error Units
PointsAlloc.circle_by_ptr avgt 5 9.108 ± 0.477 ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate avgt 5 93.792 ± 4.898 MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001 B/op
PointsAlloc.circle_by_ptr:·gc.count avgt 5 2.000 counts
PointsAlloc.circle_by_ptr:·gc.time avgt 5 4.000 ms
PointsAlloc.circle_by_value avgt 5 13.180 ± 0.611 ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate avgt 5 64.816 ± 2.964 MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001 B/op
PointsAlloc.circle_by_value:·gc.count avgt 5 2.000 counts
PointsAlloc.circle_by_value:·gc.time avgt 5 5.000 ms
-------------
Commit messages:
- tiny stylistic changes
- Storing segment addresses instead of objects in the cache appears to be slightly faster. Write barrier?
- (c)
- unit test
- move CallBufferCache out
- shave off a couple more nanos
- Add comparison benchmark for out-parameter.
- copyright header
- Benchmark:
- move pinned cache lookup out of constructor.
- ... and 20 more: https://git.openjdk.org/jdk/compare/8460072f...4a2210df
Changes: https://git.openjdk.org/jdk/pull/23142/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23142&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8287788
Stats: 402 lines in 7 files changed: 377 ins; 0 del; 25 mod
Patch: https://git.openjdk.org/jdk/pull/23142.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23142/head:pull/23142
PR: https://git.openjdk.org/jdk/pull/23142
More information about the core-libs-dev
mailing list