Improve scaling of downcalls using MemorySegments allocated with shared arenas

Stuart Monteith stuart.monteith at arm.com
Mon Dec 8 18:40:19 UTC 2025


On 08/12/2025 17:31, Maurizio Cimadamore wrote:
> 
>>
>>> This problem is quite similar to a read/write lock scenario (as you also mention):
>>>
>>> * the threads doing the acquires/release are effectively expressing a desire to "read" a segment in a given piece of 
>>> code. So, multiple readers can co-exist.
>>> * the thread doing the close is effectively expressing a desire to "write" the segment -- so it should only be 
>>> allowed to do so when there's no readers.
>>>
>>> In principle, something like this
>>>
>>> https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html
>> I experimented with StampedLock, but found that it scaled more or less the same as the existing implementation. 
>> acquire0() calls tryReadLock(), release0() calling tryUnlockRead() and justClose() calling tryWriteLock(). It appears 
>> the compare-and-swap operation is a bottleneck.
> Interesting -- thanks for sharing!
>>
>>> Should work quite well for this use cases. Or, even using a LongAdder as an acquire/release counter:
>>>
>>> https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html
>> I found LongAdder/Striped64 interesting as, if necessary, we could look at how it is handling memory. The 
>> SharedSession code I wrote allocates memory upfront, such that a scenario with one thread would use as much memory as 
>> when there are 128 threads on 128 cores. But besides that, getting the sum is not atomic, and neither is acting upon 
>> it. I experimented with  AtomicLong, with a close operation subtracting a very large value to force the counter 
>> negative, but that wasn't too different from before, dependent on atomic reads/writes to a single memory location.
> 
> Yeah --- after I sent the message I realized that sum() is good enough for ensuring e.g. that when closing there's no 
> pending acquire -- but not for the opposite: e.g. ensuring that acquire can't happen during a close... so it's weaker 
> than a RW lock (but the internals use some redundancy to reduce contention, which is kind of what you also do here).
> 

sum() is really just a snapshot, it adds up the counters (Cells), so it wouldn't ensure the counter was at zero. 
Immediately after returning zero a thread could have already incremented it.

Striped64 uses JavaUtilConcurrentTLRAccess to get random number, which could be an improvement over using Thread.hashCode().


> For the purpose of implementation clarity -- would it be useful to wrap the various counters plus logic to acquire/ 
> release (and "closing" state) into a separate abstraction, which is then used by SharedMemorySession? A sort of "atomic" 
> LongAdder, if you will :-)
> 
> That might make it easier to verify the correctness of the implementation, by validating each aspect (the atomic long 
> adder, and its use from SharedMemorySession) separately.

Sure, that would be a bit cleaner, thanks.

> 
> Cheers
> Maurizio
> 



More information about the panama-dev mailing list