Improve scaling of downcalls using MemorySegments allocated with shared arenas

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Mon Dec 15 13:33:04 UTC 2025


There's also plenty of literarure around a multi-word CAS (MCAS) that 
feature the "disjoint-access parallel" property you want -- example

https://arxiv.org/pdf/2008.02527

They all seem to split the work in two phases:

* first, they create a descriptor for the MCAS operation they want to 
perform, and atomically set (with CAS) that into each memory address 
affected by the MCAS operation
* _after_ all affected locations have been correctly acquired with the 
same descriptor, we proceed to actually perform the desired CAS on all 
of them

This is general, but also very complex, as now we need some logic to 
allocate new descriptors and deallocate them when no longer needed -- 
which likely undoes any speedup achieved by using this more complex 
algorithm.

Of course we can take inspiration from this -- the closing thread could 
use a special value to signal that he wants to "own" all the counters 
(this can be done with a CAS from the expected non-acquired value to the 
special value).

Only if the closing thread manages to successfully set this special 
values on all the counters will it proceed to close the arena.

If something went wrong, and the special value cannot be set on some 
counters, the close fails (that means some acquire operation managed to 
sneak in between), and the counter state has to be reset for all the 
updated couters.

(perhaps this same logic can also be used to avoid close vs. close 
races, as one of the closing thread will see some counters _already_ in 
the special state, which means some other thread is attempting to close).

This seems quite similar to the approach Stuart tried to implement here:

https://github.com/openjdk/jdk/pull/28575

Cheers
Maurizio


On 15/12/2025 12:29, Maurizio Cimadamore wrote:
> Another possibility would be to split the 64 bit state into, say 8-bit 
> words.
>
> Each word would act as a counter (up to 256 acquires, theoretically).
>
> This would allow acquire/release/close to work on different parts of 
> the state field (e.g. by issuing a byte-level CAS at the correct 
> offset), while still allowing the close operation to atomically CAS 
> the entire counter.
>
> But, I'm not sure this would allow for better contention, as I'd 
> assume that byte-level CAS will probably translate to a 64-bit CAS 
> with some extra bit masking logic on top...
>
> Maurizio
>
> On 14/12/2025 23:07, Stuart Monteith wrote:
>> What I'm finding with the getAndAdd version is there is often an 
>> improvement, but the split counting, the multicounting as I called 
>> it, is much better in terms of performance (I'll share them in a 
>> week). I've tried to avoid weird issues with the split counting by 
>> having the code as simple as I could make it. Keeping the states 
>> consistent is important - if the code is in the middle of closing, it 
>> is important that getting the state of the counter pauses while that 
>> is decided.
>>
>> BR,
>>     Stuart
>>
>>
>>
>> On 12/12/2025 19:25, Chris Vest wrote:
>>> Yeah, we previously also tried split counting, but reverted it 
>>> because we observed some weird rare issues, and got suspicious of it.
>>>
>>> On Wed, Dec 10, 2025 at 8:21 AM Maurizio Cimadamore 
>>> <maurizio.cimadamore at oracle.com 
>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>
>>>     What I like (a lot) about this is that now we're back to using 
>>> the same
>>>     "bit" of information for both liveness and acquire count (IIUC). If
>>>     that's the case, it would be much simpler to convince ourselves 
>>> this is
>>>     correct.
>>>
>>>     Thanks
>>>     Maurizio
>>>
>>>     On 10/12/2025 14:48, Stuart Monteith wrote:
>>>      > Thanks Chris,
>>>      >     I've taken a look and implemented SharedSession with 
>>> something
>>>      > similar to your RefCnt. One of the differences with 
>>> SharedSession is
>>>      > that we have a separate close method. I can implement 
>>> acquire0 with
>>>      > getAndAdd(2), release0 with getAndAdd(-2) and close with
>>>      > compareAndSwap(0, 1). With the additional tests against 
>>> 0x80000001 for
>>>      > acquire0 and release0, I have something that passes the unit 
>>> tests for
>>>      > java/foreign.
>>>      >
>>>      > The benchmarking is quite promising, but I'll need to look more
>>>      > closely at it - it doesn't scale better on all platforms.
>>>      >
>>>      > Thanks,
>>>      >     Stuart
>>>      >
>>>      >
>>>      >
>>>      >
>>>      > On 08/12/2025 19:45, Chris Vest wrote:
>>>      >> For what it's worth, in Netty we implement our reference 
>>> counting
>>>      >> with incrementing by 2 instead of 1, and use the low odd bit to
>>>      >> indicate the released state.
>>>      >> This allows us to acquire using getAndAdd, which scales much 
>>> better
>>>      >> than a CAS loop.
>>>      >> Unfortunately we still need to use a CAS loop when implementing
>>>      >> release, so that still has contention problems.
>>>      >>
>>>      >> For reference:
>>>      >> 
>>> https://urldefense.com/v3/__https://github.com/netty/netty/blob/2b29b5e87656203fecd1732ffb472a366a1918cc/common/
>>> src/main/java/io/__;!!ACWV5N9M2RV99hQ!JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ- 
>>>
>>>     IteJDrpe0GexRIhlFIKg6zUpWV3sr6DvTDI0$ 
>>> <https://urldefense.com/v3/__https://github.com/netty/netty/
>>> blob/2b29b5e87656203fecd1732ffb472a366a1918cc/common/src/main/java/io/__;!!ACWV5N9M2RV99hQ! 
>>>
>>> JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ-IteJDrpe0GexRIhlFIKg6zUpWV3sr6DvTDI0$> 
>>>
>>>      >> netty/util/internal/RefCnt.java#L258-L295
>>>      >> 
>>> <https://urldefense.com/v3/__https://github.com/netty/netty/blob/2b29b5e87656203fecd1732ffb472a366a1918cc/__;!!
>>> ACWV5N9M2RV99hQ!JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ-IteJDrpe0GexRIhlFIKg6zUpWV3sr9MgkwZk$ 
>>>
>>> <https://urldefense.com/v3/__https://github.com/netty/netty/blob/2b29b5e87656203fecd1732ffb472a366a1918cc/__;!! 
>>>
>>> ACWV5N9M2RV99hQ!JMRywug9hgGI_tWR1jAjiE8gIGbCfu9ZYKUrgzQiG8A3Woj6cYJa4S-ZKJ-IteJDrpe0GexRIhlFIKg6zUpWV3sr9MgkwZk$> 
>>>
>>>      >> 
>>> common/src/main/java/io/netty/util/internal/RefCnt.java#L258-L295>
>>>      >>
>>>      >> On Mon, Dec 8, 2025 at 10:42 AM Maurizio Cimadamore
>>>      >> <maurizio.cimadamore at oracle.com 
>>> <mailto:maurizio.cimadamore at oracle.com>
>>>      >> <mailto:maurizio.cimadamore at oracle.com 
>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>      >>
>>>      >>
>>>      >>      > sum() is really just a snapshot, it adds up the counters
>>>      >> (Cells), so
>>>      >>      > it wouldn't ensure the counter was at zero. 
>>> Immediately after
>>>      >>      > returning zero a thread could have already 
>>> incremented it.
>>>      >>     Yes. What I mean is: you can check if close() should throw
>>>      >> because of
>>>      >>     pending acquires. But, as I said, we can use that in any 
>>> way to
>>>      >> "block"
>>>      >>     other acquires from happening in case we _do_ want to 
>>> close. Which
>>>      >>     leaves us exposed.
>>>      >>      >
>>>      >>      >
>>>      >>      >> For the purpose of implementation clarity -- would 
>>> it be
>>>      >> useful to
>>>      >>      >> wrap the various counters plus logic to acquire/ 
>>> release (and
>>>      >>      >> "closing" state) into a separate abstraction, which 
>>> is then
>>>      >> used by
>>>      >>      >> SharedMemorySession? A sort of "atomic" LongAdder, 
>>> if you
>>>      >> will :-)
>>>      >>      >>
>>>      >>      >> That might make it easier to verify the correctness 
>>> of the
>>>      >>      >> implementation, by validating each aspect (the 
>>> atomic long
>>>      >> adder, and
>>>      >>      >> its use from SharedMemorySession) separately.
>>>      >>      >
>>>      >>      > Sure, that would be a bit cleaner, thanks.
>>>      >>
>>>      >>     Thanks.
>>>      >>
>>>      >>
>>>      >>     Maurizio
>>>      >>
>>>      >
>>>
>>


More information about the panama-dev mailing list