Improve scaling of downcalls using MemorySegments allocated with shared arenas
Stuart Monteith
stuart.monteith at arm.com
Mon Dec 8 18:40:19 UTC 2025
On 08/12/2025 17:31, Maurizio Cimadamore wrote:
>
>>
>>> This problem is quite similar to a read/write lock scenario (as you also mention):
>>>
>>> * the threads doing the acquires/release are effectively expressing a desire to "read" a segment in a given piece of
>>> code. So, multiple readers can co-exist.
>>> * the thread doing the close is effectively expressing a desire to "write" the segment -- so it should only be
>>> allowed to do so when there's no readers.
>>>
>>> In principle, something like this
>>>
>>> https://docs.oracle.com/en/java/javase/25/docs/api//java.base/java/util/concurrent/locks/StampedLock.html
>> I experimented with StampedLock, but found that it scaled more or less the same as the existing implementation.
>> acquire0() calls tryReadLock(), release0() calling tryUnlockRead() and justClose() calling tryWriteLock(). It appears
>> the compare-and-swap operation is a bottleneck.
> Interesting -- thanks for sharing!
>>
>>> Should work quite well for this use cases. Or, even using a LongAdder as an acquire/release counter:
>>>
>>> https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/atomic/LongAdder.html
>> I found LongAdder/Striped64 interesting as, if necessary, we could look at how it is handling memory. The
>> SharedSession code I wrote allocates memory upfront, such that a scenario with one thread would use as much memory as
>> when there are 128 threads on 128 cores. But besides that, getting the sum is not atomic, and neither is acting upon
>> it. I experimented with AtomicLong, with a close operation subtracting a very large value to force the counter
>> negative, but that wasn't too different from before, dependent on atomic reads/writes to a single memory location.
>
> Yeah --- after I sent the message I realized that sum() is good enough for ensuring e.g. that when closing there's no
> pending acquire -- but not for the opposite: e.g. ensuring that acquire can't happen during a close... so it's weaker
> than a RW lock (but the internals use some redundancy to reduce contention, which is kind of what you also do here).
>
sum() is really just a snapshot, it adds up the counters (Cells), so it wouldn't ensure the counter was at zero.
Immediately after returning zero a thread could have already incremented it.
Striped64 uses JavaUtilConcurrentTLRAccess to get random number, which could be an improvement over using Thread.hashCode().
> For the purpose of implementation clarity -- would it be useful to wrap the various counters plus logic to acquire/
> release (and "closing" state) into a separate abstraction, which is then used by SharedMemorySession? A sort of "atomic"
> LongAdder, if you will :-)
>
> That might make it easier to verify the correctness of the implementation, by validating each aspect (the atomic long
> adder, and its use from SharedMemorySession) separately.
Sure, that would be a bit cleaner, thanks.
>
> Cheers
> Maurizio
>
More information about the panama-dev
mailing list