RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations

Sun Jan 19 21:09:15 UTC 2025

Certain signatures for foreign function calls require allocation of an intermediate buffer to adapt the FFM's to the native stub's calling convention ("needsReturnBuffer"). In the current implementation, this buffer is malloced and freed on every FFM invocation, a non-negligible overhead.

Sample stack trace:

   java.lang.Thread.State: RUNNABLE
	at jdk.internal.misc.Unsafe.allocateMemory0(java.base at 25-ea/Native Method)
	at jdk.internal.misc.Unsafe.allocateMemory(java.base at 25-ea/Unsafe.java:636)
	at jdk.internal.foreign.SegmentFactories.allocateMemoryWrapper(java.base at 25-ea/SegmentFactories.java:215)
	at jdk.internal.foreign.SegmentFactories.allocateSegment(java.base at 25-ea/SegmentFactories.java:193)
	at jdk.internal.foreign.ArenaImpl.allocateNoInit(java.base at 25-ea/ArenaImpl.java:55)
	at jdk.internal.foreign.ArenaImpl.allocate(java.base at 25-ea/ArenaImpl.java:60)
	at jdk.internal.foreign.ArenaImpl.allocate(java.base at 25-ea/ArenaImpl.java:34)
	at java.lang.foreign.SegmentAllocator.allocate(java.base at 25-ea/SegmentAllocator.java:645)
	at jdk.internal.foreign.abi.SharedUtils$2.<init>(java.base at 25-ea/SharedUtils.java:388)
	at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base at 25-ea/SharedUtils.java:386)
	at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base at 25-ea/Unknown Source)
	at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base at 25-ea/DirectMethodHandle$Holder)
	at java.lang.invoke.LambdaForm$MH/0x000001f00109a400.invoke(java.base at 25-ea/LambdaForm$MH)
	at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base at 25-ea/Invokers$Holder)

When does this happen? A fairly easy way to trigger this is through returning a small aggregate like the following:

struct Vector2D {
  double x, y;
};
Vector2D Origin() {
  return {0, 0};
}

On AArch64, such a struct is returned in two 128 bit registers v0/v1.
The VM's calling convention for the native stub consequently expects an 32 byte output segment argument.
The FFM downcall method handle instead expects to create a 16 byte result segment through the application-provided SegmentAllocator, and needs to perform an appropriate adaptation, roughly like so:

  MemorySegment downcallMH(SegmentAllocator a) {
    MemorySegment tmp = SharedUtils.allocate(32);
    try {
      nativeStub.invoke(tmp);  // leaves v0, v1 in tmp
      MemorySegment result = a.allocate(16);
      result.setDouble(0, tmp.getDouble(0));
      result.setDouble(8, tmp.getDouble(16));
      return result;
    } finally {
      free(tmp);
    }
  }

You might argue that this cost is not worse than what happens through the result allocator. However, the application has control over this, and may provide a segment-reusing allocator in a loop:

  MemorySegment result = allocate(resultLayout);
  SegmentAllocator allocator = (_, _)->result;
  loop:
    mh.invoke(allocator);  <= would like to avoid hidden allocations in here

To alleviate this, This PR remembers and reuses one single such intermediate buffer per carrier-thread in subsequent calls, very similar to what happens in the sun.nio.ch.Util.BufferCache or sun.nio.fs.NativeBuffers, which face a similar issues.

Performance (MBA M3):

# VM version: JDK 25-ea, OpenJDK 64-Bit Server VM, 25-ea+3-283
Benchmark                                        Mode  Cnt      Score      Error   Units
PointsAlloc.circle_by_ptr                        avgt    5      8.964 ±   0.351   ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate         avgt    5     95.301 ±   3.665  MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm    avgt    5      0.224 ±   0.001    B/op
PointsAlloc.circle_by_ptr:·gc.count              avgt    5      2.000            counts
PointsAlloc.circle_by_ptr:·gc.time               avgt    5      3.000                ms
PointsAlloc.circle_by_value                      avgt    5     46.498 ±   2.336   ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate       avgt    5  13141.578 ± 650.425  MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm  avgt    5    160.224 ±   0.001    B/op
PointsAlloc.circle_by_value:·gc.count            avgt    5    116.000            counts
PointsAlloc.circle_by_value:·gc.time             avgt    5     44.000                ms

# VM version: JDK 25-internal, OpenJDK 64-Bit Server VM, 25-internal-adhoc.mernst.jdk
Benchmark                                        Mode  Cnt   Score    Error   Units
PointsAlloc.circle_by_ptr                        avgt    5   9.108 ±  0.477   ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate         avgt    5  93.792 ±  4.898  MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm    avgt    5   0.224 ±  0.001    B/op
PointsAlloc.circle_by_ptr:·gc.count              avgt    5   2.000           counts
PointsAlloc.circle_by_ptr:·gc.time               avgt    5   4.000               ms
PointsAlloc.circle_by_value                      avgt    5  13.180 ±  0.611   ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate       avgt    5  64.816 ±  2.964  MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm  avgt    5   0.224 ±  0.001    B/op
PointsAlloc.circle_by_value:·gc.count            avgt    5   2.000           counts
PointsAlloc.circle_by_value:·gc.time             avgt    5   5.000               ms

-------------

Commit messages:
 - tiny stylistic changes
 - Storing segment addresses instead of objects in the cache appears to be slightly faster. Write barrier?
 - (c)
 - unit test
 - move CallBufferCache out
 - shave off a couple more nanos
 - Add comparison benchmark for out-parameter.
 - copyright header
 - Benchmark:
 - move pinned cache lookup out of constructor.
 - ... and 20 more: https://git.openjdk.org/jdk/compare/8460072f...4a2210df

Changes: https://git.openjdk.org/jdk/pull/23142/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23142&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8287788
  Stats: 402 lines in 7 files changed: 377 ins; 0 del; 25 mod
  Patch: https://git.openjdk.org/jdk/pull/23142.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23142/head:pull/23142

PR: https://git.openjdk.org/jdk/pull/23142