RFR: 8347408: Create an internal method handle adapter for system calls with errno
Per Minborg
pminborg at openjdk.org
Wed Feb 26 12:58:23 UTC 2025
As we advance, converting older JDK code to use the relatively new FFM API requires system calls that can provide `errno` and the likes to explicitly allocate a `MemorySegment` to capture potential error states. This can lead to negative performance implications if not designed carefully and also introduces unnecessary code complexity.
Hence, this PR proposes adding a JDK internal method handle adapter that can be used to handle system calls with `errno`, `GetLastError`, and `WSAGetLastError`.
It relies on an efficient carrier-thread-local cache of memory regions to allide allocations.
Here are some benchmarks that ran on a platform thread and virtual threads respectively (M1 Mac):
Benchmark Mode Cnt Score Error Units
CaptureStateUtilBench.OfVirtual.adaptedSysCallFail avgt 30 24.330 ? 0.820 ns/op
CaptureStateUtilBench.OfVirtual.adaptedSysCallSuccess avgt 30 8.257 ? 0.117 ns/op
CaptureStateUtilBench.OfVirtual.explicitAllocationFail avgt 30 41.415 ? 1.013 ns/op
CaptureStateUtilBench.OfVirtual.explicitAllocationSuccess avgt 30 21.720 ? 0.463 ns/op
CaptureStateUtilBench.OfVirtual.tlAllocationFail avgt 30 23.636 ? 0.182 ns/op
CaptureStateUtilBench.OfVirtual.tlAllocationSuccess avgt 30 8.234 ? 0.156 ns/op
CaptureStateUtilBench.adaptedSysCallFail avgt 30 23.918 ? 0.487 ns/op
CaptureStateUtilBench.adaptedSysCallSuccess avgt 30 4.946 ? 0.089 ns/op
CaptureStateUtilBench.explicitAllocationFail avgt 30 42.280 ? 1.128 ns/op
CaptureStateUtilBench.explicitAllocationSuccess avgt 30 21.809 ? 0.413 ns/op
CaptureStateUtilBench.tlAllocationFail avgt 30 24.422 ? 0.673 ns/op
CaptureStateUtilBench.tlAllocationSuccess avgt 30 5.182 ? 0.152 ns/op
Adapted system call:
return (int) ADAPTED_HANDLE.invoke(0, 0); // Uses a MH-internal pool
```
Explicit allocation:
try (var arena = Arena.ofConfined()) {
return (int) HANDLE.invoke(arena.allocate(4), 0, 0);
}
```
Thread Local allocation:
try (var arena = POOLS.take()) {
return (int) HANDLE.invoke(arena.allocate(4), 0, 0); // Uses a manually specified pool
}
```
The adapted system call exhibits a ~4x performance improvement over the existing "explicit allocation" scheme for the happy path on platform threads. Because there needs to be sharing across threads for virtual-thread-capable carrier threads, these are a bit slower ("only" ~3x faster).

Here are some benchmarks for the underlying ArenaPool (M1 Mac):
Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
ArenaPoolBench.OfVirtual.confined 4 avgt 30 23.543 ? 0.168 ns/op
ArenaPoolBench.OfVirtual.confined 64 avgt 30 27.384 ? 0.167 ns/op
ArenaPoolBench.OfVirtual.confined2 4 avgt 30 47.811 ? 0.220 ns/op
ArenaPoolBench.OfVirtual.confined2 64 avgt 30 55.404 ? 0.286 ns/op
ArenaPoolBench.OfVirtual.pooled 4 avgt 30 8.210 ? 0.043 ns/op
ArenaPoolBench.OfVirtual.pooled 64 avgt 30 45.525 ? 52.525 ns/op
ArenaPoolBench.OfVirtual.pooled2 4 avgt 30 50.670 ? 0.778 ns/op
ArenaPoolBench.OfVirtual.pooled2 64 avgt 30 85.846 ? 2.304 ns/op
ArenaPoolBench.confined 4 avgt 30 23.286 ? 0.184 ns/op
ArenaPoolBench.confined 64 avgt 30 27.026 ? 0.111 ns/op
ArenaPoolBench.confined2 4 avgt 30 48.301 ? 0.942 ns/op
ArenaPoolBench.confined2 64 avgt 30 57.512 ? 5.373 ns/op
ArenaPoolBench.pooled 4 avgt 30 5.085 ? 0.048 ns/op
ArenaPoolBench.pooled 64 avgt 30 29.621 ? 0.440 ns/op
ArenaPoolBench.pooled2 4 avgt 30 10.610 ? 0.339 ns/op
ArenaPoolBench.pooled2 64 avgt 30 60.815 ? 1.046 ns/op
ArenaPoolFromBench.OfVirtual.confinedInt N/A avgt 30 21.944 ? 0.122 ns/op
ArenaPoolFromBench.OfVirtual.confinedSting N/A avgt 30 26.190 ? 0.193 ns/op
ArenaPoolFromBench.OfVirtual.pooledInt N/A avgt 30 8.217 ? 0.043 ns/op
ArenaPoolFromBench.OfVirtual.pooledString N/A avgt 30 9.271 ? 0.056 ns/op
ArenaPoolFromBench.confinedInt N/A avgt 30 21.892 ? 0.139 ns/op
ArenaPoolFromBench.confinedSting N/A avgt 30 26.012 ? 0.058 ns/op
ArenaPoolFromBench.pooledInt N/A avgt 30 5.056 ? 0.034 ns/op
ArenaPoolFromBench.pooledString N/A avgt 30 6.419 ? 0.037 ns/op
Note: The pool size for the above benchmarks was 32 bytes.
This PR relates to https://github.com/openjdk/jdk/pull/23391 we had to back out. This PR attempts to ensure, that the problems encountered there do not surface in this PR.
The arena pool is able to share recyclable memory across several arenas, for platform threads.
This PR passes tier1, tier2, and tier3 testing.
-------------
Commit messages:
- Use lazy initialization of method handles
- Clean up visibility
- Merge branch 'master' into errno-util3
- Add @ForceInline annotations
- Add out of order test for VTs
- Allow memory reuse for several arenas
- Remove file
- Use more frequent allocations
- Merge branch 'master' into errno-util3
- Add unsafe variant
- ... and 21 more: https://git.openjdk.org/jdk/compare/037e4711...907329e9
Changes: https://git.openjdk.org/jdk/pull/23765/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23765&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8347408
Stats: 1902 lines in 13 files changed: 1891 ins; 2 del; 9 mod
Patch: https://git.openjdk.org/jdk/pull/23765.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23765/head:pull/23765
PR: https://git.openjdk.org/jdk/pull/23765
More information about the core-libs-dev
mailing list