<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hi Stuart,<br>

      I believe your suggested approach is a good idea to consider.

      Splitting the acquire/release counters seems like a good idea and

      one that, to some extent, as also been used elsewhere (e.g.

      LongAdder, ConcurrentHashMap) to improve throughput under

      contention.</p>

    <p>The "tricky bit" is to make sure we can do all this while

      retaining correctness, as this is an already very tricky part of

      the code.</p>

    <p>In the next few weeks we'll look at the code you wrote in more

      details and try to flesh out potential issues.</p>

    <p>This problem is quite similar to a read/write lock scenario (as

      you also mention):</p>

    <p>* the threads doing the acquires/release are effectively

      expressing a desire to "read" a segment in a given piece of code.

      So, multiple readers can co-exist.<br>

      * the thread doing the close is effectively expressing a desire to

      "write" the segment -- so it should only be allowed to do so when

      there's no readers.</p>

    <p>In principle, something like this<br>

    </p>

    <p><a class="moz-txt-link-freetext" href="https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html">https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html</a></p>

    <p>Should work quite well for this use cases. Or, even using a

      LongAdder as an acquire/release counter:</p>

    <p><a class="moz-txt-link-freetext" href="https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html">https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html</a></p>

    <p>Note the javadoc on this class:</p>

    <p>>This class is usually preferable to <a href="https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/AtomicLong.html" title="class in java.util.concurrent.atomic"><code>AtomicLong</code></a>

      when

      multiple threads update a common sum that is used for purposes

      such

      as collecting statistics, not for fine-grained synchronization

      control. Under low update contention, the two classes have similar

      characteristics. But under high contention, expected throughput of

      this class is significantly higher, at the expense of higher space

      consumption.

    </p>

    <p>This should in principle guarantee superior performance under

      contention (similar to what you did). But, no matter the road

      taken, doing so requires separating the liveness bit from the

      acquire count -- and that's what need to be analyzed more

      carefully (e.g. this means making the acquire fail when the

      segment is closed, or about to be closed).<br>

    </p>

    <p>Thanks!<br>

      Maurizio<br>

    </p>

    <div class="moz-cite-prefix">On 05/12/2025 19:52, Stuart Monteith

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:3506dd6c-ef3d-4d4b-972b-fe83bb5debd8@arm.com">Hello,

      <br>

          I have encountered a scaling problem with

      java.lang.foreign.MemorySegments when they are passed to native

      code. When native methods are called with, say, around more than 8

      cores with MemorySegments allocated from an Arena allocated from

      Arena.ofShared(), scaling can be sublinear under contention.

      <br>

      <br>

      From profiling, it is apparent that excessive time is being spent

      in jdk.internal.foreign.SharedSession's acquire0 and release0

      methods. These methods check and increment or decrement the

      acquireCount field, which is used to prevent the SharedSession

      from being closed if the acquireCount is not zero. it is also used

      to prevent acquire0 from succeeding if the acquireCount is already

      closed.

      <br>

      <br>

      The issue can be demonstrated with the micro benchmark:

      <br>

        org.openjdk.bench.java.lang.foreign.

      CallOverheadConstant.panama_identity_memory_address_shared_3

      <br>

      <br>

      It produces the following results on the Neoverse-N1, N2, V1 and

      V2, Intel Xeon 8375c and the AMD Epyc 9654.

      <br>

      <br>

      Each machine has >48 cores, and the results are in nanoseconds:

      <br>

      <br>

      Threads    N1        N2        V1        V2        Xeon       Epyc

      <br>

      1          30.88     32.15     33.54     32.82     27.46      8.45

      <br>

      2         142.56    134.48    132.01    131.50    116.68     46.53

      <br>

      4         310.18    282.75    287.59    271.82    251.88     86.11

      <br>

      8         702.02    710.29    736.72    670.63    533.46    194.60

      <br>

      16      1,436.17  1,684.80  1,833.69  1,782.78  1,100.15    827.28

      <br>

      24      2,185.55  2,508.86  2,732.22  2,815.26  1,646.09  1,530.28

      <br>

      32      2,942.48  3,432.84  3,643.64  3,782.23  2,236.81  2,278.52

      <br>

      48      4,466.56  5,174.72  5,401.95  5,621.41  4,926.30  3,026.58

      <br>

      <br>

      The microbenchmark is repeatedly calling a small native method, so

      the timings demonstrate the worst case. With perfect scaling, the

      timing should be the same for 48 threads on 48 cores as it is for

      1 thread on 1 core.

      <br>

      <br>

      The solution I came up with is here:

      <a class="moz-txt-link-freetext" href="https://github.com/openjdk/jdk/pull/28575">https://github.com/openjdk/jdk/pull/28575</a>

      <br>

      It was suggested that it would have been better to have first

      discussed the issue here, but it was useful me to try and solve

      the issue first in order to understand the code.

      <br>

      <br>

      The fundamental issue is producing a counter that can scale while

      ensuring races between acquires and closes behave correctly. To

      address this, I opted to use multiple counters, with at least as

      many counters as there are CPUs on the machine. The current

      thread's hashcode is used to select a counter. The counters are

      allocated in an AtomicIntegerArray, and are spaced such that they

      lie on their own cache-line.

      <br>

      <br>

      The API has been changed such that acquire0 returns an int  - a

      ticket that needs to be passed to the corresponding release0(int)

      method. This was necessary as there is code where they are

      executed on separate threads, which would results in separate

      counters being incremented then decremented. The changes are quite

      widespread because of this.

      <br>

      If the original release0() method is kept, and the counter

      recalculated instead of being passed as a ticket, there are fewer

      changes, but perhaps an increased risk of bugs being introduced.

      <br>

      <br>

      The SharedSession.justClose() method will either successfully

      close and racing acquires will fail, or vice-verse. In order to do

      this, there are two phases, such that the counters are put into a

      "closing" state if they can. Racing acquires will spin on this

      state. If the close comes across an counter in use, it will set

      the counters in the "closing" state to open, and then fail.

      Otherwise all of the counters in "closing" state are committed to

      "closed" state and any threads spinning in "acquire" will receive

      the IllegalStateException indicating that the SharedSession have

      been closed.

      <br>

      <br>

      The PR above produces the following results on the same benchmark

      and machines:

      <br>

      <br>

      Threads    N1      N2        V1        V2        Xeon      Epyc

      <br>

      1          32.41   32.11     34.43     31.32     27.94     9.82

      <br>

      2          32.64   33.72     35.11     31.30     28.02     9.81

      <br>

      4          32.71   36.84     34.67     31.35     28.12    10.49

      <br>

      8          58.22   31.60     36.87     31.72     47.09    16.52

      <br>

      16         70.15   47.76     52.37     47.26     70.91    14.53

      <br>

      24         77.38   78.14     81.67     71.98     87.20    21.70

      <br>

      32         87.54   98.01     84.73     86.79    109.25    22.65

      <br>

      48        121.54  128.14    120.51    104.35    175.08    26.85

      <br>

      <br>

      In a more complex workload, we were seeing up to 3x better

      performance, rather than 40x as contention will typically be lower

      than that in the microbenchmark presented here.

      <br>

      <br>

      The PR changes code quite outside its immediate area, and adds

      some complexity, different timing and more memory overhead,

      particularly on machines with large core counts. There is also a

      reliance of the threads involved hashing well to avoid contention.

      Closing may also take more time on machines with large

      core-counts. I wouldn't anticipate threads spinning in acquire0

      during closing to be an issue as well-written programs should

      close when the SharedSession is unused. There is a small cost in

      the uncontended, single threaded benchmark on the order or 1-2ns.

      <br>

      <br>

      Programmer's might instead be advised to use Shared.ofAuto() if

      contention is the biggest issue, however this may introduce a

      delay when freeing resources.

      <br>

      <br>

      <br>

      In terms of alternatives in the literature, there is the paper

      "Scalable Reader-Writer Locks" (Lev, Luchangco, Olszewski, 2009)

      introduces C-SNZI, which has the concept of a "Closeable, Scalable

      Non-Zero Indicator". It does require more complex data-structures

      that have to be traversed in a fashion that may be slower than the

      SharedSession implemention today.

      <br>

      <br>

      I look forward to hearing peoples thoughts.

      <br>

      <br>

      Stuart

      <br>

      <br>

      <br>

      <br>

      <br>

    </blockquote>

  </body>

</html>