Improve scaling of downcalls using MemorySegments allocated with shared arenas

Fri Dec 5 19:52:04 UTC 2025

Hello,
	I have encountered a scaling problem with java.lang.foreign.MemorySegments when they are passed to native code. When 
native methods are called with, say, around more than 8 cores with MemorySegments allocated from an Arena allocated from 
Arena.ofShared(), scaling can be sublinear under contention.

 From profiling, it is apparent that excessive time is being spent in jdk.internal.foreign.SharedSession's acquire0 and 
release0 methods. These methods check and increment or decrement the acquireCount field, which is used to prevent the 
SharedSession from being closed if the acquireCount is not zero. it is also used to prevent acquire0 from succeeding if 
the acquireCount is already closed.

The issue can be demonstrated with the micro benchmark:
   org.openjdk.bench.java.lang.foreign. CallOverheadConstant.panama_identity_memory_address_shared_3

It produces the following results on the Neoverse-N1, N2, V1 and V2, Intel Xeon 8375c and the AMD Epyc 9654.

Each machine has >48 cores, and the results are in nanoseconds:

Threads    N1        N2        V1        V2        Xeon       Epyc
1          30.88     32.15     33.54     32.82     27.46      8.45
2         142.56    134.48    132.01    131.50    116.68     46.53
4         310.18    282.75    287.59    271.82    251.88     86.11
8         702.02    710.29    736.72    670.63    533.46    194.60
16      1,436.17  1,684.80  1,833.69  1,782.78  1,100.15    827.28
24      2,185.55  2,508.86  2,732.22  2,815.26  1,646.09  1,530.28
32      2,942.48  3,432.84  3,643.64  3,782.23  2,236.81  2,278.52
48      4,466.56  5,174.72  5,401.95  5,621.41  4,926.30  3,026.58

The microbenchmark is repeatedly calling a small native method, so the timings demonstrate the worst case. With perfect 
scaling, the timing should be the same for 48 threads on 48 cores as it is for 1 thread on 1 core.

The solution I came up with is here: https://github.com/openjdk/jdk/pull/28575
It was suggested that it would have been better to have first discussed the issue here, but it was useful me to try and 
solve the issue first in order to understand the code.

The fundamental issue is producing a counter that can scale while ensuring races between acquires and closes behave 
correctly. To address this, I opted to use multiple counters, with at least as many counters as there are CPUs on the 
machine. The current thread's hashcode is used to select a counter. The counters are allocated in an AtomicIntegerArray, 
and are spaced such that they lie on their own cache-line.

The API has been changed such that acquire0 returns an int  - a ticket that needs to be passed to the corresponding 
release0(int) method. This was necessary as there is code where they are executed on separate threads, which would 
results in separate counters being incremented then decremented. The changes are quite widespread because of this.
If the original release0() method is kept, and the counter recalculated instead of being passed as a ticket, there are 
fewer changes, but perhaps an increased risk of bugs being introduced.

The SharedSession.justClose() method will either successfully close and racing acquires will fail, or vice-verse. In 
order to do this, there are two phases, such that the counters are put into a "closing" state if they can. Racing 
acquires will spin on this state. If the close comes across an counter in use, it will set the counters in the "closing" 
state to open, and then fail. Otherwise all of the counters in "closing" state are committed to "closed" state and any 
threads spinning in "acquire" will receive the IllegalStateException indicating that the SharedSession have been closed.

The PR above produces the following results on the same benchmark and machines:

Threads    N1      N2        V1        V2        Xeon      Epyc
1          32.41   32.11     34.43     31.32     27.94     9.82
2          32.64   33.72     35.11     31.30     28.02     9.81
4          32.71   36.84     34.67     31.35     28.12    10.49
8          58.22   31.60     36.87     31.72     47.09    16.52
16         70.15   47.76     52.37     47.26     70.91    14.53
24         77.38   78.14     81.67     71.98     87.20    21.70
32         87.54   98.01     84.73     86.79    109.25    22.65
48        121.54  128.14    120.51    104.35    175.08    26.85

In a more complex workload, we were seeing up to 3x better performance, rather than 40x as contention will typically be 
lower than that in the microbenchmark presented here.

The PR changes code quite outside its immediate area, and adds some complexity, different timing and more memory 
overhead, particularly on machines with large core counts. There is also a reliance of the threads involved hashing well 
to avoid contention. Closing may also take more time on machines with large core-counts. I wouldn't anticipate threads 
spinning in acquire0 during closing to be an issue as well-written programs should close when the SharedSession is 
unused. There is a small cost in the uncontended, single threaded benchmark on the order or 1-2ns.

Programmer's might instead be advised to use Shared.ofAuto() if contention is the biggest issue, however this may 
introduce a delay when freeing resources.

In terms of alternatives in the literature, there is the paper "Scalable Reader-Writer Locks" (Lev, Luchangco, 
Olszewski, 2009) introduces C-SNZI, which has the concept of a "Closeable, Scalable Non-Zero Indicator". It does require 
more complex data-structures that have to be traversed in a fashion that may be slower than the SharedSession 
implemention today.

I look forward to hearing peoples thoughts.

Stuart