Improve scaling of downcalls using MemorySegments allocated with shared arenas
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Dec 8 12:13:50 UTC 2025
Hi Stuart,
I believe your suggested approach is a good idea to consider. Splitting
the acquire/release counters seems like a good idea and one that, to
some extent, as also been used elsewhere (e.g. LongAdder,
ConcurrentHashMap) to improve throughput under contention.
The "tricky bit" is to make sure we can do all this while retaining
correctness, as this is an already very tricky part of the code.
In the next few weeks we'll look at the code you wrote in more details
and try to flesh out potential issues.
This problem is quite similar to a read/write lock scenario (as you also
mention):
* the threads doing the acquires/release are effectively expressing a
desire to "read" a segment in a given piece of code. So, multiple
readers can co-exist.
* the thread doing the close is effectively expressing a desire to
"write" the segment -- so it should only be allowed to do so when
there's no readers.
In principle, something like this
https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html
Should work quite well for this use cases. Or, even using a LongAdder as
an acquire/release counter:
https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html
Note the javadoc on this class:
>This class is usually preferable to |AtomicLong|
<https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html>
when multiple threads update a common sum that is used for purposes such
as collecting statistics, not for fine-grained synchronization control.
Under low update contention, the two classes have similar
characteristics. But under high contention, expected throughput of this
class is significantly higher, at the expense of higher space consumption.
This should in principle guarantee superior performance under contention
(similar to what you did). But, no matter the road taken, doing so
requires separating the liveness bit from the acquire count -- and
that's what need to be analyzed more carefully (e.g. this means making
the acquire fail when the segment is closed, or about to be closed).
Thanks!
Maurizio
On 05/12/2025 19:52, Stuart Monteith wrote:
> Hello,
> I have encountered a scaling problem with
> java.lang.foreign.MemorySegments when they are passed to native code.
> When native methods are called with, say, around more than 8 cores
> with MemorySegments allocated from an Arena allocated from
> Arena.ofShared(), scaling can be sublinear under contention.
>
> From profiling, it is apparent that excessive time is being spent in
> jdk.internal.foreign.SharedSession's acquire0 and release0 methods.
> These methods check and increment or decrement the acquireCount field,
> which is used to prevent the SharedSession from being closed if the
> acquireCount is not zero. it is also used to prevent acquire0 from
> succeeding if the acquireCount is already closed.
>
> The issue can be demonstrated with the micro benchmark:
> org.openjdk.bench.java.lang.foreign.
> CallOverheadConstant.panama_identity_memory_address_shared_3
>
> It produces the following results on the Neoverse-N1, N2, V1 and V2,
> Intel Xeon 8375c and the AMD Epyc 9654.
>
> Each machine has >48 cores, and the results are in nanoseconds:
>
> Threads N1 N2 V1 V2 Xeon Epyc
> 1 30.88 32.15 33.54 32.82 27.46 8.45
> 2 142.56 134.48 132.01 131.50 116.68 46.53
> 4 310.18 282.75 287.59 271.82 251.88 86.11
> 8 702.02 710.29 736.72 670.63 533.46 194.60
> 16 1,436.17 1,684.80 1,833.69 1,782.78 1,100.15 827.28
> 24 2,185.55 2,508.86 2,732.22 2,815.26 1,646.09 1,530.28
> 32 2,942.48 3,432.84 3,643.64 3,782.23 2,236.81 2,278.52
> 48 4,466.56 5,174.72 5,401.95 5,621.41 4,926.30 3,026.58
>
> The microbenchmark is repeatedly calling a small native method, so the
> timings demonstrate the worst case. With perfect scaling, the timing
> should be the same for 48 threads on 48 cores as it is for 1 thread on
> 1 core.
>
> The solution I came up with is here:
> https://github.com/openjdk/jdk/pull/28575
> It was suggested that it would have been better to have first
> discussed the issue here, but it was useful me to try and solve the
> issue first in order to understand the code.
>
> The fundamental issue is producing a counter that can scale while
> ensuring races between acquires and closes behave correctly. To
> address this, I opted to use multiple counters, with at least as many
> counters as there are CPUs on the machine. The current thread's
> hashcode is used to select a counter. The counters are allocated in an
> AtomicIntegerArray, and are spaced such that they lie on their own
> cache-line.
>
> The API has been changed such that acquire0 returns an int - a ticket
> that needs to be passed to the corresponding release0(int) method.
> This was necessary as there is code where they are executed on
> separate threads, which would results in separate counters being
> incremented then decremented. The changes are quite widespread because
> of this.
> If the original release0() method is kept, and the counter
> recalculated instead of being passed as a ticket, there are fewer
> changes, but perhaps an increased risk of bugs being introduced.
>
> The SharedSession.justClose() method will either successfully close
> and racing acquires will fail, or vice-verse. In order to do this,
> there are two phases, such that the counters are put into a "closing"
> state if they can. Racing acquires will spin on this state. If the
> close comes across an counter in use, it will set the counters in the
> "closing" state to open, and then fail. Otherwise all of the counters
> in "closing" state are committed to "closed" state and any threads
> spinning in "acquire" will receive the IllegalStateException
> indicating that the SharedSession have been closed.
>
> The PR above produces the following results on the same benchmark and
> machines:
>
> Threads N1 N2 V1 V2 Xeon Epyc
> 1 32.41 32.11 34.43 31.32 27.94 9.82
> 2 32.64 33.72 35.11 31.30 28.02 9.81
> 4 32.71 36.84 34.67 31.35 28.12 10.49
> 8 58.22 31.60 36.87 31.72 47.09 16.52
> 16 70.15 47.76 52.37 47.26 70.91 14.53
> 24 77.38 78.14 81.67 71.98 87.20 21.70
> 32 87.54 98.01 84.73 86.79 109.25 22.65
> 48 121.54 128.14 120.51 104.35 175.08 26.85
>
> In a more complex workload, we were seeing up to 3x better
> performance, rather than 40x as contention will typically be lower
> than that in the microbenchmark presented here.
>
> The PR changes code quite outside its immediate area, and adds some
> complexity, different timing and more memory overhead, particularly on
> machines with large core counts. There is also a reliance of the
> threads involved hashing well to avoid contention. Closing may also
> take more time on machines with large core-counts. I wouldn't
> anticipate threads spinning in acquire0 during closing to be an issue
> as well-written programs should close when the SharedSession is
> unused. There is a small cost in the uncontended, single threaded
> benchmark on the order or 1-2ns.
>
> Programmer's might instead be advised to use Shared.ofAuto() if
> contention is the biggest issue, however this may introduce a delay
> when freeing resources.
>
>
> In terms of alternatives in the literature, there is the paper
> "Scalable Reader-Writer Locks" (Lev, Luchangco, Olszewski, 2009)
> introduces C-SNZI, which has the concept of a "Closeable, Scalable
> Non-Zero Indicator". It does require more complex data-structures that
> have to be traversed in a fashion that may be slower than the
> SharedSession implemention today.
>
> I look forward to hearing peoples thoughts.
>
> Stuart
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251208/a2e7a81c/attachment-0001.htm>
More information about the panama-dev
mailing list