Improve scaling of downcalls using MemorySegments allocated with shared arenas

Stuart Monteith stuart.monteith at arm.com
Tue Dec 23 15:17:49 UTC 2025


On 15/12/2025 13:33, Maurizio Cimadamore wrote:
> There's also plenty of literarure around a multi-word CAS (MCAS) that feature the "disjoint-access parallel" property 
> you want -- example
> 
> https://arxiv.org/pdf/2008.02527

Thanks for the link, that is interesting.

> 
> They all seem to split the work in two phases:
> 
> * first, they create a descriptor for the MCAS operation they want to perform, and atomically set (with CAS) that into 
> each memory address affected by the MCAS operation
> * _after_ all affected locations have been correctly acquired with the same descriptor, we proceed to actually perform 
> the desired CAS on all of them
> 

In our case, we'd be setting one word in acquire0/release0. Only the justClose0 method would acquire all of the 
counters. For Java, I'd expect we'd be operating on arrays of objects - we're not going to swap longs and Object 
references - maybe autoboxed Integers and WordDescriptors. This would change the common case of an load and a CAS to 
something more complex involving an allocation and multiple additional stores and loads.

> This is general, but also very complex, as now we need some logic to allocate new descriptors and deallocate them when 
> no longer needed -- which likely undoes any speedup achieved by using this more complex algorithm.
> 

That is my concern - allocating before you even begin to increment/decrement a counter is going to be costly. There are 
also the levels of indirection this would involve. Having threads help other threads progress is interesting, but in 
this instance, we really are just incrementing/decrementing counters - by the time you've checked if you ought to help, 
I suspect it's better to just retry.

> Of course we can take inspiration from this -- the closing thread could use a special value to signal that he wants to 
> "own" all the counters (this can be done with a CAS from the expected non-acquired value to the special value).
> 

In the PR the special value is -1 "CNT_CLOSING", which isn't help in a separate state, but in each counter. I suppose it 
is possible to have a separate value (perhaps the existing MemorySessionImpl.state variable), but that would add an 
additional read to acquire0. It would be an alternative way of controlling concurrent calls to justClose() which I 
worked around by making the method synchronized. I considered using java.util.concurrent.locks.ReentrantLock, but there 
is unlikely to be any contention.

Having a global value might help address one concern I have, where a global counter is used in the uncontended state, 
but that is changed to indicate when the state is contended and the multiple counters have been allocated and used 
instead. I haven't prototyped such a thing yet - someone would have to win the race to perform the expansion, and 
managing the counters when a global counter is >0 would be tricky - I suspect the transition would need to occur if the 
counter ever transitioned back to 0. We don't have global state to place the acquire0'd counters through the expanded 
multicounters.

> Only if the closing thread manages to successfully set this special values on all the counters will it proceed to close 
> the arena.
> 
> If something went wrong, and the special value cannot be set on some counters, the close fails (that means some acquire 
> operation managed to sneak in between), and the counter state has to be reset for all the updated couters.
> 
> (perhaps this same logic can also be used to avoid close vs. close races, as one of the closing thread will see some 
> counters _already_ in the special state, which means some other thread is attempting to close).
> 
> This seems quite similar to the approach Stuart tried to implement here:
> 

Yes, while setting the counters to CNT_CLOSING, acquire0 calls will spin until their counter is changes back to "0" if 
the close fails, or the acquire0 will fail if the counter is set to CNT_CLOSED. The closing thread will succeed if all 
of the counters are set from 0 to CNT_CLOSING, in which case they are set to CNT_CLOSED. Otherwise, the closing thread 
will change the counters it set from CNT_CLOSING to 0, and then fail itself. So, yes, very similar.

> https://github.com/openjdk/jdk/pull/28575
> 
> Cheers
> Maurizio
> 
> 
> On 15/12/2025 12:29, Maurizio Cimadamore wrote:
>> Another possibility would be to split the 64 bit state into, say 8-bit words.
>>
>> Each word would act as a counter (up to 256 acquires, theoretically).
>>
>> This would allow acquire/release/close to work on different parts of the state field (e.g. by issuing a byte-level CAS 
>> at the correct offset), while still allowing the close operation to atomically CAS the entire counter.
>>
>> But, I'm not sure this would allow for better contention, as I'd assume that byte-level CAS will probably translate to 
>> a 64-bit CAS with some extra bit masking logic on top...
>>
>> Maurizio
>>
>> On 14/12/2025 23:07, Stuart Monteith wrote:
>>> What I'm finding with the getAndAdd version is there is often an improvement, but the split counting, the 
>>> multicounting as I called it, is much better in terms of performance (I'll share them in a week). I've tried to avoid 
>>> weird issues with the split counting by having the code as simple as I could make it. Keeping the states consistent 
>>> is important - if the code is in the middle of closing, it is important that getting the state of the counter pauses 
>>> while that is decided.
>>>
>>> BR,
>>>     Stuart
>>>
>>>
>>>
>>> On 12/12/2025 19:25, Chris Vest wrote:
>>>> Yeah, we previously also tried split counting, but reverted it because we observed some weird rare issues, and got 
>>>> suspicious of it.
>>>>
>>>> On Wed, Dec 10, 2025 at 8:21 AM Maurizio Cimadamore <maurizio.cimadamore at oracle.com 
>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>
>>>>     What I like (a lot) about this is that now we're back to using the same
>>>>     "bit" of information for both liveness and acquire count (IIUC). If
>>>>     that's the case, it would be much simpler to convince ourselves this is
>>>>     correct.
>>>>
>>>>     Thanks
>>>>     Maurizio
>>>>
>>>>     On 10/12/2025 14:48, Stuart Monteith wrote:
>>>>      > Thanks Chris,
>>>>      >     I've taken a look and implemented SharedSession with something
>>>>      > similar to your RefCnt. One of the differences with SharedSession is
>>>>      > that we have a separate close method. I can implement acquire0 with
>>>>      > getAndAdd(2), release0 with getAndAdd(-2) and close with
>>>>      > compareAndSwap(0, 1). With the additional tests against 0x80000001 for
>>>>      > acquire0 and release0, I have something that passes the unit tests for
>>>>      > java/foreign.
>>>>      >
>>>>      > The benchmarking is quite promising, but I'll need to look more
>>>>      > closely at it - it doesn't scale better on all platforms.
>>>>      >
>>>>      > Thanks,
>>>>      >     Stuart
>>>>      >
>>>>      >
>>>>      >
>>>>      >
>>>>      > On 08/12/2025 19:45, Chris Vest wrote:
>>>>      >> For what it's worth, in Netty we implement our reference counting
>>>>      >> with incrementing by 2 instead of 1, and use the low odd bit to
>>>>      >> indicate the released state.
>>>>      >> This allows us to acquire using getAndAdd, which scales much better
>>>>      >> than a CAS loop.
>>>>      >> Unfortunately we still need to use a CAS loop when implementing
>>>>      >> release, so that still has contention problems.
>>>>      >>
>>>>      >> For reference:
>>>>      >> https://urldefense.com/v3/__https://github.com/netty/netty/blob/2b29b5e87656203fecd1732ffb472a366a1918cc/ 
>>>> common/
>>>> src/main/java/io/__;!!ACWV5N9M2RV99hQ!JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ-
>>>>     IteJDrpe0GexRIhlFIKg6zUpWV3sr6DvTDI0$ <https://urldefense.com/v3/__https://github.com/netty/netty/
>>>> blob/2b29b5e87656203fecd1732ffb472a366a1918cc/common/src/main/java/io/__;!!ACWV5N9M2RV99hQ!
>>>> JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ-IteJDrpe0GexRIhlFIKg6zUpWV3sr6DvTDI0$>
>>>>      >> netty/util/internal/RefCnt.java#L258-L295
>>>>      >> <https://urldefense.com/v3/__https://github.com/netty/netty/blob/2b29b5e87656203fecd1732ffb472a366a1918cc/__;!!
>>>> ACWV5N9M2RV99hQ!JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ-IteJDrpe0GexRIhlFIKg6zUpWV3sr9MgkwZk$
>>>> <https://urldefense.com/v3/__https://github.com/netty/netty/blob/2b29b5e87656203fecd1732ffb472a366a1918cc/__;!!
>>>> ACWV5N9M2RV99hQ!JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ-IteJDrpe0GexRIhlFIKg6zUpWV3sr9MgkwZk$>
>>>>      >> common/src/main/java/io/netty/util/internal/RefCnt.java#L258-L295>
>>>>      >>
>>>>      >> On Mon, Dec 8, 2025 at 10:42 AM Maurizio Cimadamore
>>>>      >> <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>
>>>>      >> <mailto:maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>      >>
>>>>      >>
>>>>      >>      > sum() is really just a snapshot, it adds up the counters
>>>>      >> (Cells), so
>>>>      >>      > it wouldn't ensure the counter was at zero. Immediately after
>>>>      >>      > returning zero a thread could have already incremented it.
>>>>      >>     Yes. What I mean is: you can check if close() should throw
>>>>      >> because of
>>>>      >>     pending acquires. But, as I said, we can use that in any way to
>>>>      >> "block"
>>>>      >>     other acquires from happening in case we _do_ want to close. Which
>>>>      >>     leaves us exposed.
>>>>      >>      >
>>>>      >>      >
>>>>      >>      >> For the purpose of implementation clarity -- would it be
>>>>      >> useful to
>>>>      >>      >> wrap the various counters plus logic to acquire/ release (and
>>>>      >>      >> "closing" state) into a separate abstraction, which is then
>>>>      >> used by
>>>>      >>      >> SharedMemorySession? A sort of "atomic" LongAdder, if you
>>>>      >> will :-)
>>>>      >>      >>
>>>>      >>      >> That might make it easier to verify the correctness of the
>>>>      >>      >> implementation, by validating each aspect (the atomic long
>>>>      >> adder, and
>>>>      >>      >> its use from SharedMemorySession) separately.
>>>>      >>      >
>>>>      >>      > Sure, that would be a bit cleaner, thanks.
>>>>      >>
>>>>      >>     Thanks.
>>>>      >>
>>>>      >>
>>>>      >>     Maurizio
>>>>      >>
>>>>      >
>>>>
>>>



More information about the panama-dev mailing list