Improve scaling of downcalls using MemorySegments allocated with shared arenas
Stuart Monteith
stuart.monteith at arm.com
Mon Dec 8 17:05:33 UTC 2025
On 08/12/2025 12:13, Maurizio Cimadamore wrote:
> Hi Stuart,
> I believe your suggested approach is a good idea to consider. Splitting the acquire/release counters seems like a good
> idea and one that, to some extent, as also been used elsewhere (e.g. LongAdder, ConcurrentHashMap) to improve throughput
> under contention.
>
> The "tricky bit" is to make sure we can do all this while retaining correctness, as this is an already very tricky part
> of the code.
Yes, the heart of the issue is maintaining correctness while scaling. My compromise was to produce something that had a
resemblance to transaction memory, recognising that accounting for the state of multiple counters necessarily is spread
through multiple memory operation.
>
> In the next few weeks we'll look at the code you wrote in more details and try to flesh out potential issues.
>
Thanks, I very much appreciate that.
> This problem is quite similar to a read/write lock scenario (as you also mention):
>
> * the threads doing the acquires/release are effectively expressing a desire to "read" a segment in a given piece of
> code. So, multiple readers can co-exist.
> * the thread doing the close is effectively expressing a desire to "write" the segment -- so it should only be allowed
> to do so when there's no readers.
>
> In principle, something like this
>
> https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html
>
I experimented with StampedLock, but found that it scaled more or less the same as the existing implementation.
acquire0() calls tryReadLock(), release0() calling tryUnlockRead() and justClose() calling tryWriteLock(). It appears
the compare-and-swap operation is a bottleneck.
> Should work quite well for this use cases. Or, even using a LongAdder as an acquire/release counter:
>
> https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html
>
I found LongAdder/Striped64 interesting as, if necessary, we could look at how it is handling memory. The SharedSession
code I wrote allocates memory upfront, such that a scenario with one thread would use as much memory as when there are
128 threads on 128 cores. But besides that, getting the sum is not atomic, and neither is acting upon it. I experimented
with AtomicLong, with a close operation subtracting a very large value to force the counter negative, but that wasn't
too different from before, dependent on atomic reads/writes to a single memory location.
> Note the javadoc on this class:
>
> >This class is usually preferable to |AtomicLong| <https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/
> util/concurrent/atomic/AtomicLong.html> when multiple threads update a common sum that is used for purposes such as
> collecting statistics, not for fine-grained synchronization control. Under low update contention, the two classes have
> similar characteristics. But under high contention, expected throughput of this class is significantly higher, at the
> expense of higher space consumption.
>
> This should in principle guarantee superior performance under contention (similar to what you did). But, no matter the
> road taken, doing so requires separating the liveness bit from the acquire count -- and that's what need to be analyzed
> more carefully (e.g. this means making the acquire fail when the segment is closed, or about to be closed).
>
Indeed, atomically setting the count and liveness is really the issue - you can't atomically check and set the sum from
LongAdder.
> Thanks!
> Maurizio
>
Thanks,
Stuart
> On 05/12/2025 19:52, Stuart Monteith wrote:
>> Hello,
>> I have encountered a scaling problem with java.lang.foreign.MemorySegments when they are passed to native code.
>> When native methods are called with, say, around more than 8 cores with MemorySegments allocated from an Arena
>> allocated from Arena.ofShared(), scaling can be sublinear under contention.
>>
>> From profiling, it is apparent that excessive time is being spent in jdk.internal.foreign.SharedSession's acquire0 and
>> release0 methods. These methods check and increment or decrement the acquireCount field, which is used to prevent the
>> SharedSession from being closed if the acquireCount is not zero. it is also used to prevent acquire0 from succeeding
>> if the acquireCount is already closed.
>>
>> The issue can be demonstrated with the micro benchmark:
>> org.openjdk.bench.java.lang.foreign. CallOverheadConstant.panama_identity_memory_address_shared_3
>>
>> It produces the following results on the Neoverse-N1, N2, V1 and V2, Intel Xeon 8375c and the AMD Epyc 9654.
>>
>> Each machine has >48 cores, and the results are in nanoseconds:
>>
>> Threads N1 N2 V1 V2 Xeon Epyc
>> 1 30.88 32.15 33.54 32.82 27.46 8.45
>> 2 142.56 134.48 132.01 131.50 116.68 46.53
>> 4 310.18 282.75 287.59 271.82 251.88 86.11
>> 8 702.02 710.29 736.72 670.63 533.46 194.60
>> 16 1,436.17 1,684.80 1,833.69 1,782.78 1,100.15 827.28
>> 24 2,185.55 2,508.86 2,732.22 2,815.26 1,646.09 1,530.28
>> 32 2,942.48 3,432.84 3,643.64 3,782.23 2,236.81 2,278.52
>> 48 4,466.56 5,174.72 5,401.95 5,621.41 4,926.30 3,026.58
>>
>> The microbenchmark is repeatedly calling a small native method, so the timings demonstrate the worst case. With
>> perfect scaling, the timing should be the same for 48 threads on 48 cores as it is for 1 thread on 1 core.
>>
>> The solution I came up with is here: https://github.com/openjdk/jdk/pull/28575
>> It was suggested that it would have been better to have first discussed the issue here, but it was useful me to try
>> and solve the issue first in order to understand the code.
>>
>> The fundamental issue is producing a counter that can scale while ensuring races between acquires and closes behave
>> correctly. To address this, I opted to use multiple counters, with at least as many counters as there are CPUs on the
>> machine. The current thread's hashcode is used to select a counter. The counters are allocated in an
>> AtomicIntegerArray, and are spaced such that they lie on their own cache-line.
>>
>> The API has been changed such that acquire0 returns an int - a ticket that needs to be passed to the corresponding
>> release0(int) method. This was necessary as there is code where they are executed on separate threads, which would
>> results in separate counters being incremented then decremented. The changes are quite widespread because of this.
>> If the original release0() method is kept, and the counter recalculated instead of being passed as a ticket, there are
>> fewer changes, but perhaps an increased risk of bugs being introduced.
>>
>> The SharedSession.justClose() method will either successfully close and racing acquires will fail, or vice-verse. In
>> order to do this, there are two phases, such that the counters are put into a "closing" state if they can. Racing
>> acquires will spin on this state. If the close comes across an counter in use, it will set the counters in the
>> "closing" state to open, and then fail. Otherwise all of the counters in "closing" state are committed to "closed"
>> state and any threads spinning in "acquire" will receive the IllegalStateException indicating that the SharedSession
>> have been closed.
>>
>> The PR above produces the following results on the same benchmark and machines:
>>
>> Threads N1 N2 V1 V2 Xeon Epyc
>> 1 32.41 32.11 34.43 31.32 27.94 9.82
>> 2 32.64 33.72 35.11 31.30 28.02 9.81
>> 4 32.71 36.84 34.67 31.35 28.12 10.49
>> 8 58.22 31.60 36.87 31.72 47.09 16.52
>> 16 70.15 47.76 52.37 47.26 70.91 14.53
>> 24 77.38 78.14 81.67 71.98 87.20 21.70
>> 32 87.54 98.01 84.73 86.79 109.25 22.65
>> 48 121.54 128.14 120.51 104.35 175.08 26.85
>>
>> In a more complex workload, we were seeing up to 3x better performance, rather than 40x as contention will typically
>> be lower than that in the microbenchmark presented here.
>>
>> The PR changes code quite outside its immediate area, and adds some complexity, different timing and more memory
>> overhead, particularly on machines with large core counts. There is also a reliance of the threads involved hashing
>> well to avoid contention. Closing may also take more time on machines with large core-counts. I wouldn't anticipate
>> threads spinning in acquire0 during closing to be an issue as well-written programs should close when the
>> SharedSession is unused. There is a small cost in the uncontended, single threaded benchmark on the order or 1-2ns.
>>
>> Programmer's might instead be advised to use Shared.ofAuto() if contention is the biggest issue, however this may
>> introduce a delay when freeing resources.
>>
>>
>> In terms of alternatives in the literature, there is the paper "Scalable Reader-Writer Locks" (Lev, Luchangco,
>> Olszewski, 2009) introduces C-SNZI, which has the concept of a "Closeable, Scalable Non-Zero Indicator". It does
>> require more complex data-structures that have to be traversed in a fashion that may be slower than the SharedSession
>> implemention today.
>>
>> I look forward to hearing peoples thoughts.
>>
>> Stuart
>>
>>
>>
>>
More information about the panama-dev
mailing list