<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Hi Stuart,<br>
I believe your suggested approach is a good idea to consider.
Splitting the acquire/release counters seems like a good idea and
one that, to some extent, as also been used elsewhere (e.g.
LongAdder, ConcurrentHashMap) to improve throughput under
contention.</p>
<p>The "tricky bit" is to make sure we can do all this while
retaining correctness, as this is an already very tricky part of
the code.</p>
<p>In the next few weeks we'll look at the code you wrote in more
details and try to flesh out potential issues.</p>
<p>This problem is quite similar to a read/write lock scenario (as
you also mention):</p>
<p>* the threads doing the acquires/release are effectively
expressing a desire to "read" a segment in a given piece of code.
So, multiple readers can co-exist.<br>
* the thread doing the close is effectively expressing a desire to
"write" the segment -- so it should only be allowed to do so when
there's no readers.</p>
<p>In principle, something like this<br>
</p>
<p><a class="moz-txt-link-freetext" href="https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html">https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html</a></p>
<p>Should work quite well for this use cases. Or, even using a
LongAdder as an acquire/release counter:</p>
<p><a class="moz-txt-link-freetext" href="https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html">https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html</a></p>
<p>Note the javadoc on this class:</p>
<p>>This class is usually preferable to <a href="https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html" title="class in java.util.concurrent.atomic"><code>AtomicLong</code></a>
when
multiple threads update a common sum that is used for purposes
such
as collecting statistics, not for fine-grained synchronization
control. Under low update contention, the two classes have similar
characteristics. But under high contention, expected throughput of
this class is significantly higher, at the expense of higher space
consumption.
</p>
<p>This should in principle guarantee superior performance under
contention (similar to what you did). But, no matter the road
taken, doing so requires separating the liveness bit from the
acquire count -- and that's what need to be analyzed more
carefully (e.g. this means making the acquire fail when the
segment is closed, or about to be closed).<br>
</p>
<p>Thanks!<br>
Maurizio<br>
</p>
<div class="moz-cite-prefix">On 05/12/2025 19:52, Stuart Monteith
wrote:<br>
</div>
<blockquote type="cite" cite="mid:3506dd6c-ef3d-4d4b-972b-fe83bb5debd8@arm.com">Hello,
<br>
I have encountered a scaling problem with
java.lang.foreign.MemorySegments when they are passed to native
code. When native methods are called with, say, around more than 8
cores with MemorySegments allocated from an Arena allocated from
Arena.ofShared(), scaling can be sublinear under contention.
<br>
<br>
From profiling, it is apparent that excessive time is being spent
in jdk.internal.foreign.SharedSession's acquire0 and release0
methods. These methods check and increment or decrement the
acquireCount field, which is used to prevent the SharedSession
from being closed if the acquireCount is not zero. it is also used
to prevent acquire0 from succeeding if the acquireCount is already
closed.
<br>
<br>
The issue can be demonstrated with the micro benchmark:
<br>
org.openjdk.bench.java.lang.foreign.
CallOverheadConstant.panama_identity_memory_address_shared_3
<br>
<br>
It produces the following results on the Neoverse-N1, N2, V1 and
V2, Intel Xeon 8375c and the AMD Epyc 9654.
<br>
<br>
Each machine has >48 cores, and the results are in nanoseconds:
<br>
<br>
Threads N1 N2 V1 V2 Xeon Epyc
<br>
1 30.88 32.15 33.54 32.82 27.46 8.45
<br>
2 142.56 134.48 132.01 131.50 116.68 46.53
<br>
4 310.18 282.75 287.59 271.82 251.88 86.11
<br>
8 702.02 710.29 736.72 670.63 533.46 194.60
<br>
16 1,436.17 1,684.80 1,833.69 1,782.78 1,100.15 827.28
<br>
24 2,185.55 2,508.86 2,732.22 2,815.26 1,646.09 1,530.28
<br>
32 2,942.48 3,432.84 3,643.64 3,782.23 2,236.81 2,278.52
<br>
48 4,466.56 5,174.72 5,401.95 5,621.41 4,926.30 3,026.58
<br>
<br>
The microbenchmark is repeatedly calling a small native method, so
the timings demonstrate the worst case. With perfect scaling, the
timing should be the same for 48 threads on 48 cores as it is for
1 thread on 1 core.
<br>
<br>
The solution I came up with is here:
<a class="moz-txt-link-freetext" href="https://github.com/openjdk/jdk/pull/28575">https://github.com/openjdk/jdk/pull/28575</a>
<br>
It was suggested that it would have been better to have first
discussed the issue here, but it was useful me to try and solve
the issue first in order to understand the code.
<br>
<br>
The fundamental issue is producing a counter that can scale while
ensuring races between acquires and closes behave correctly. To
address this, I opted to use multiple counters, with at least as
many counters as there are CPUs on the machine. The current
thread's hashcode is used to select a counter. The counters are
allocated in an AtomicIntegerArray, and are spaced such that they
lie on their own cache-line.
<br>
<br>
The API has been changed such that acquire0 returns an int - a
ticket that needs to be passed to the corresponding release0(int)
method. This was necessary as there is code where they are
executed on separate threads, which would results in separate
counters being incremented then decremented. The changes are quite
widespread because of this.
<br>
If the original release0() method is kept, and the counter
recalculated instead of being passed as a ticket, there are fewer
changes, but perhaps an increased risk of bugs being introduced.
<br>
<br>
The SharedSession.justClose() method will either successfully
close and racing acquires will fail, or vice-verse. In order to do
this, there are two phases, such that the counters are put into a
"closing" state if they can. Racing acquires will spin on this
state. If the close comes across an counter in use, it will set
the counters in the "closing" state to open, and then fail.
Otherwise all of the counters in "closing" state are committed to
"closed" state and any threads spinning in "acquire" will receive
the IllegalStateException indicating that the SharedSession have
been closed.
<br>
<br>
The PR above produces the following results on the same benchmark
and machines:
<br>
<br>
Threads N1 N2 V1 V2 Xeon Epyc
<br>
1 32.41 32.11 34.43 31.32 27.94 9.82
<br>
2 32.64 33.72 35.11 31.30 28.02 9.81
<br>
4 32.71 36.84 34.67 31.35 28.12 10.49
<br>
8 58.22 31.60 36.87 31.72 47.09 16.52
<br>
16 70.15 47.76 52.37 47.26 70.91 14.53
<br>
24 77.38 78.14 81.67 71.98 87.20 21.70
<br>
32 87.54 98.01 84.73 86.79 109.25 22.65
<br>
48 121.54 128.14 120.51 104.35 175.08 26.85
<br>
<br>
In a more complex workload, we were seeing up to 3x better
performance, rather than 40x as contention will typically be lower
than that in the microbenchmark presented here.
<br>
<br>
The PR changes code quite outside its immediate area, and adds
some complexity, different timing and more memory overhead,
particularly on machines with large core counts. There is also a
reliance of the threads involved hashing well to avoid contention.
Closing may also take more time on machines with large
core-counts. I wouldn't anticipate threads spinning in acquire0
during closing to be an issue as well-written programs should
close when the SharedSession is unused. There is a small cost in
the uncontended, single threaded benchmark on the order or 1-2ns.
<br>
<br>
Programmer's might instead be advised to use Shared.ofAuto() if
contention is the biggest issue, however this may introduce a
delay when freeing resources.
<br>
<br>
<br>
In terms of alternatives in the literature, there is the paper
"Scalable Reader-Writer Locks" (Lev, Luchangco, Olszewski, 2009)
introduces C-SNZI, which has the concept of a "Closeable, Scalable
Non-Zero Indicator". It does require more complex data-structures
that have to be traversed in a fashion that may be slower than the
SharedSession implemention today.
<br>
<br>
I look forward to hearing peoples thoughts.
<br>
<br>
Stuart
<br>
<br>
<br>
<br>
<br>
</blockquote>
</body>
</html>